COMPUTATIONAL 
MATHEMATICS 

B. P. Demidovich 

I. A. Maron 





5. N. Bemnposny 
M. A. Mapon 
OCHOBbi 
BbINMCIIMTENbHOR 
MATEMATMKH 
MsgavensctBo 


«Hayka» 


COMPUTATIONAL MATHEMATICS 


B. P, DEMIDOVICH, |, A. MARON 


TRANSLATED FROM THE RUSSIAN 
BY GEORGE YANKOVSKY 


MIR PUBLISHERS > MOSCOW 


First published 1973 
Second printing 1976 
Third printing 1981 


Ha axenautcrom aske 


© English translation, Mir Publishers, 1981 


PREFACE 


The rapid development of computing machines and the broad- 
ening application of modern mathematical techniques to enginee- 
ring investigations have greatly enhanced demands concerning the 
mathematical training of engineers and scientific workers who deal 
with applied problems. 

The mathematical education of the investigating engineer can no 
longer be confined to the traditional departments of the so-called 
“classical analysis” which was established in its basic outlines at 
the beginning of this century. Research engineers have to know 
many areas of modern mathematics and, primarily, require a firm 
grasp of the methods and techniques of computational mathematics 
insofar as the solution of almost every engineering problem must 
be carried to a numerical result. 

Present-day computing devices have greatly extended the realm 
of computational work, making it possible, in many instances, to 
» reject approximate interpretations of applied problems and pass on 
to the solution of precisely stated problems. This involves the 
utilization of deeper specialized divisions of mathematics (nonlinear 
differential equations, functional analysis, probabilistic methods, 
etc.). 

Proper utilization of modern computers is impossible without 
the skilled use of methods of approximate and numerical analysis. 
All this explains the universal enhanced interest in the methods 
of computational mathematics. 

The basic aim of this book is to give as far as possible a syste- 
matic and modern presentation of the most important methods and 
techniques of computational mathematics on the basis of the general 
course of higher mathematics taught in higher technical schools, 
The. book has been arranged so that the basic portion constitutes 
a manual for the first cycle of ‘studies in approximate computations 
for higher technical colleges. The text contains supplementary ma- 
terial which goes beyond the scope of the ordinary college course, 
but the reader can select those sections which interest him and 
omit any extra material without loss of continuity. The chapters 
and sections which may be dropped out in a first reading are 
marked with an asterisk. - 


6 Preface 


This text makes wide use of matrix calculus. The concepts of 
a vector, matrix, inverse matrix, eigenvalue and eigenvector of 
a matrix, etc. are workaday tools. The use of matrices offers a num- 
ber of advantages in presenting the subject matter since they greatly 
facilitate an understanding of the development of many computa- 
tions. In this sense a particular gain is achieved in the proofs of 
the convergence theorems of various numerical processes. Also, 
modern high-speed computers are nicely adapted to the performance 
of the basic matrix operations. 

For a full comprehension of the contents of this, book, the rea- 
der should have a background of linear algebra and the ‘theory of 
linear vector spaces. With the aim of making the text as self-con- 
tained as possible, the authors have included ail the necessary 
starting material in these subjects. The appropriate chapters are 
completely independent of the basic text and can be omitted by 
readers who have already studied these sections. 

A few words about the contents of the book. In the main it is 
devoted to the following problems: operations involving approximate 
numbers, computation of functions by means of series and iterative 
processes, approximate and numerical solution of algebraic ‘and 
transcendental equations, computational methods of linear algebra, 
interpolation of functions, numerical differentiation and integration 
of functions, and the Monte Carlo method. 

A great deal of attention is devoted to methods of error esti- 
mation. Nearly all processes are provided with proofs of conver- 
gence theorems, and the presentation is such that the proofs may 
be omitted if one wishes to confine himself to the technical aspects 
of the matter. In, certain cases, in order to pictorialize and lighten 
the presentation, the computational techniques are given as simple 
recipes. 

The basic methods are carried to numerical applications that 
include computational schemes and numerical examples with de- 
tailed steps of solution. To facilitate understanding the essence of 
the matter at hand, most of the problems are stated in simple 
form and are of an illustrative nature. References are given at 
the.end of each chapter and the complete list (in alphabetical 
order) is given at the end of the book. 

The present text offers selected methods in computational mathe- 
matics and does not include material that involves empirical for- 
mulas, quadratic approximation of functions, approximate solutions 
of differential equations, etc. Likewise, the book does not include 
material on programming and the technical aspects of solving 
mathematical problems on computers. The interested reader must 
consult the special literature on these subjects. 

B. P. Demidovich, 
I. A. Maron 


CONTENTS 


PREFACE vit fcc Cu cece e eA Wee BEL SS wee eae od . 


INTRODUCTION. GENERAL RULES OF COMPUTATIONAL WORK ..., 


CHAPTER 1 
APPROXIMATE NUMBERS ...... 2... ee eee eee or ee 
1.1 Absolute and relative errors . . . 1. eee eee ee te ee 
1.2 Basic sources of errors . . . 2... 2... ee ee te 
1.3 Scientific notation. Significant digits. The number of correct digits 
14 Rounding of numbers .. 7... we ee ee es 
1.5 Relationship between the relative error of an approximate number 
and the number of correct digits. ...........+77 0+ 
1.6 Tables for determining the limiting relative error from the number 
of correct digits and vice versa . 2... 1... pp ere eee 
17 The error: ol a SUM jo ek eh eed ee lee eee Poe ek G i 
1.8 The error of a difference . , ,.. oara aea’ oe Syne EAEE 
1.9 The error of a product no w sra i koe ee eee ee ew ee 
1.10 The number of correct digits in a product ........-+0- 
1,11 The error of a quotient ......,..,, 000s Sie oP te 7 
1.12 The number of correct digits in a quotient... ........ 
1.13 The relative error of a power»... ... ee ee eee 
1.14 The relative error of a root... aooaa eee eee ‘o 
1.15 Computations in which errors are not taken into exact account 
1.16 General formula for errors. . 2... 2. ee eee ee ee ‘ 
1.17 The inverse problem of the theory of errors... .......04 
1.18 Accuracy in the determination of arguments from a tabulated 
PUNCHON: 237 Sse a Be Bg te a oe a ee Ra pees aS ete a e 
1.19 The method of bounds... 2... ee ee et te 
*1.20 The notion of a probability error estimate ....... ete grat 
References for Chapter} . 2... we a ee te Ses fer 

CHAPTER 2 
SOME FACTS FROM THE THEORY OF CONTINUED FRACTIONS... . . 
2.1 The definition of a continued fraction. . 2... .....--. : 


2.2 Converting a continued fraction to a simple fraction and vice versa 


15 


19 


-19 


22 
23 
26 


27 


30 
33 
35 
37 
39 
40 
4] 
41 
41 
42 
42 
44 


48 
50 
52 
54 


55 


55 
56 


8 Contents 


2.3 Convergents..... BD ie Bibb Re Ge gs a ed a .. 58 
2.4 Nonterminating continued fractions. . 2. ee et ee 66 
2.6 Expanding functions into continued fractions ........., P 72 
References for Chapter 2... oaaae’ See tate 6th oe fe a 76 
CHAPTER 3 
COMPUTING THE VALUES OF FUNCTIONS : Li ; on 77 | 
3.1 Computing the values of a polynomial. Horner’s scheme... .. 77 
3.2 The generalized Horner scheme... . p, ooo a e.. 80 
3.3 Computing the values of rational fractions ........... 82 
3.4 Approximating the sums of numerical series . ......... 83 
3.5 Computing the values of an analytic function. ........ , 89 
3.6 Computing the values of exponential functions .......4.. 9] 
3.7 Computing the values of a logarithmic function. ........ 95 
3.8 Computing the values of trigonometric functions ........ 98 
3.9 Computing the values of hyperbolic functions... ....... 101 
3.10 Using the method of iteration for approximating the values of a 
funt tion ee oe Pd o e Bese prey AED aoi de e a 103 
3.11 Computing reciprocals. . . o.oo ea ee ee ee 104 
3.12 Computing square roots ........ eth Oe nen out s... 107 
3.13 Computing the reciprocal of a square root .........-. 111 
3.14 Computing cube roots . 2... ee te 112 
References for Chapter 3 ........--.0- A a ae Bal wane ca gE 114 
CHAPTER 4 
APPROXIMATE SOLUTIONS OF ALGEBRAIC AND TRANSCENDENTAL 
EQUATIONS ............. si i A te ddl on th ,... 1B 
4.1 Isolation of roots ,....... Seth ate cas tS xe et oe Bente 115 
4,2 Graphical solution of equations . . .., p ie ah Fe, at o PO E renarn 8 119 
4.3 The halving method . . 2.2... ..-..040 2068, son 121 
4.4 The method of proportional parts (method of chords) ...... 122 
4.5 Newton’s method (method of tangents) ........ ees NEE 
4.6 Modified Newton method... 0. s aaa aa .. 135 
4,7 Combination method ..... ihe peed Calin Wh sata aa? ak E 136 
4.8 The method of iteration ......... pba, 8% Pose Mote Se 138 
4.9 The method of iteration for a system of two equations ..... 152 
4.10 Newton’s method for a system of two equations. ........ . 156 
4,11 Newton’s method for the case of complex roots ......... 157 
References for Chapter 4 2... we eet ee 161 
CHAPTER 5 
SPECIAL TECHNIQUES FOR APPROXIMATE SOLUTION OF ALGEBRAIC 
EQUATIONS 2-21 1 ee ee . 162 
5.1 General properties of algebraic equations .......-..205 162 


5.2 The bounds of real roots of algebraic equations .........- 167 


Contents 9 


5.3 The method of alternating sums ..... ee gee iat ode, See 169 
5,4 Newton's method. ............. paee ED ARTEA Gr 17] 
5.5 The number of real roots of a polynomial... ......... 173 
5.6 The theorem of Budan-Fourier .........2- ee eee 175 
5.7 The underlying principle of the method of Lobachevsky-Graeffe . . 179 
-5.8 The root-squaring process 2... soaa ee ee 182 
5.9 The Lobachevsky-Graeffe method for the case of real and distinct 

TOOLSE eai a2 EA E tty Br EE AETA PAR ete Soe Se eae .« 184 
5.10 The Lobachevsky-Graeffe method for the case of complex foots .. 187 
5.11 The case of a pair of complex roots ........ a 190 
5.12 The case of two pairs of complex roots. ....... wee ye J94 
5.13 Bernoulli's method ..........50.00-202- Bike Say 198 
References for Chapter 5 ....... Sakae ees ee eal: aa: D2 
CHAPTER 6 
ACCELERATING THE CONVERGENCE OF SERIES. ..... Backes) 203 
6.1 Accelerating the convergence of numerical series. .....,..,.. 203 
6.2 Aecelerating the convergence of power series by the Euler-Abel 

methods. fs tel Bok, Sr ce SR ee AE eS ate Mur picts Bo We obs 209 
6.3 Estimates of Fourier coefficients ....... 0... + peewee 213 


6.4 Accelerating the convergence of Fourier trigonometric series by. the 


method of A. N. Krylov 2... 1... ee ee ee ee 217 
6.5 Trigonometric approximation. ..........+2++78- . 225 
References for Chapter 6 .......-- + wuee Banks Wee: eta, CELSO 
CHAPTER 7 
MATRIX ALGEBRA ..........-.- BUS hea Spee eBay EASE ety owe 229 
TIU Basic definitions: 2.4. sos ee ee ee ee ee ee A 229 
7.2 Operations involving matrices 2... 1. ee ee ee ee ee 230 
7.3 The transpose of a matrix . 2... 1 ee eee ee ee ee 234 
7.4 The.inverse matrix .......-. Se ates Bee eee a Oe 236 
7.5 Powers of a matrix . 2... ee eet ee ee ee et 240 
7.6 Rational‘functions of a matrix... 1... Cin de Ske SYR Ae ee 241 
7.7 The absolute value and norm of a matrix .. 1. we ee eee 242 
7.8 The rank of a matrix ala ee ee ee ee 248 
7.9 The limit of a matrix... ......- ee eae decelerate 249 
7.10 Series of matrices ..-... 0. ote fn Se. ig ee oe heh gee) do .. 25] 
7.11 Partitioned matrices. . 2... wwe ee ee Sk Gees, 2256 
7.12 Matrix inversion by partitioning .............. +. 260 
7.13 Triangular matrices s . e,o ee 265 
7.14 Elementary transformations of matrices... ee ee 268 
7.15 Computation of determinants... 1. we ee ee ee 269 


References for Chapter 7 . s. s sonos eee he es an Kew En T22 


10 ` Contents 


CHAPTER 8 
SOLVING SYSTEMS OF LINEAR EQUATIONS 


8.1 A general description of methods of solving systems of linear equa- 
TIONS! o 45 5 eee ss, BS dati einai ie Drath, he le athe te ea 
8.2 Solution by inversion of matrices, Cramer’s rule 
8.3 The Gaussian method 
8.4 Improving roots. ....... ee 
8.5 The method of principal elements ahs 
8.6 Use of the Gaussian method in computing determinants ..... 
8.7 Inversion of matrices by the Gaussian method 
8.8 Square-root method ....... . ee eee ee eee 
8.9 The scheme of Khaletsky 
8.10 The method of iteration . . 7... ee ee 
8.11 Reducing a linear system to a form convenient for iteration 
8.12 The Seidel method .... 2... ee ee ee te ee 
8.13 The case of a normal system 
8.14 The method of relaxation 
8.15 Correcting elements of an approximate inverse matrix 
References for Chapter8 ........2.2.6-. oi 


....s:e»s s 


*CHAPTER 9 


THE CONVERGENCE OF ITERATION PROCESSES FOR SYSTEMS OF LINEAR 
EQUATIONS 


9.1 Sufficient conditions for the convergence of the iteration process 
9.2 An estimate of the error of approximations in the iteration process 
9.3 First sufficient condition for convergence of the Seidel process 
9.4 Estimating the error of approximations in the Seidel process by 
the mnorm . gae tk a ee ee ee ae 
9.5 Second sufficient condition for convergence of the Seidel process, . 
9.6 Estimating the error of approximations in the Seidel process K the 
Emorm soroetan koa ee E E E eh ; a 
9.7 Third sufficient condition for convergence of the Seidel proces os 
References for Chapter 9 .,.. ye ee eee Pr a 
$ 


CHAPTER 410 l 
ESSENTIALS OF THE THEORY OF LINEAR VECTOR SPACES ...... 


10.1 The concept of a linear vector space . 1... ee Pe ae 
10.2 The linear dependence of vectors ....6-- 0.5 50 see 
10.3 The scalar product of vectors .......-., 
10,4 Orthogonal systems of vectors... ....2.-ee ee ts x 
10.5 Transformations of the coordinates of a vector under changes: in 

. the basis 
10.6 Orthogonal matrices 2... 1 ao ee eee dict 
10.7 Orthogonalization of matrices . . 22. ees ee eee vses 


. eo 6 6 © ey es 


273 


273 
273 
277 
284 
287 
288 
290 
293 
296 
300 
307 
309 
311 
313 
316 
321 


322 


322 
324 
327 


330 
330 


332 
333 
335 


"336 


336 
337 
343 
345 


348 
350 
351 


Contents 


10.8 Applying orthogonalization methods to the solution of systems of 


linear equations .., 2... 2... eee ee ech laher dos 7 
10.9 The solution space of a homogeneous system. ........ ‘ 
10.10 Linear transformations of variables ........-....--, 
10.11 Inverse transformation .. 2... 1 ee ee ee ee eee ee . 
10.12 Eigenvectors and eigenvalues of a matrix .......... ‘ 
10.13 Similar matrices... 2... eee ee Beg ae gee Be Sal Me . 
10.14 Bilinear form of a matrix. 2... .. 1 eee eee EES 
10.15 Properties of symmetric matrices . .. sno saeson pai 
*10.16 Properties of matrices with real elements .........4+46 
References for Chapter19,...... is eh, ee a ee SA 


*CHAPTER 11- 
ADDITIONAL FACTS ABOUT THE CONVERGENCE OF ITERATION 


PROCESSES FOR SYSTEMS OF LINEAR EQUATIONS ...... e.’ 
11.1 The convergence of matrix power series ,. .. s.s oe 
11.2 The Cayley-Hamilton theorem... .......-..--. ‘ 


11.3 Necessary and sufficient conditions for the convergence of the pro- 
cess of iteration for a system of linear equations... . 

11.4 Necessary and sufficient conditions for the convergence of the Sei- 
del process for a system of linear equations ........ yr 

11.5 Convergence of the Seidel process for a normal system ....-. 

11.6 Methods for effectively checking the conditions of convergence . 

References for Chapter 11 


CHAPTER 12 
FINDING THE EIGENVALUES AND EIGENVECTORS OF A MATRIX .. , 


12.1 Introductory remarks 


12.2 Expansion of secular determinants... ....-- ep eee f 
12,3 The method of Danilevsky .. noaoae eee ee 
12.4 Exceptional cases in the Danilevsky method .... . et 
12.5 Computation of eigenvectors by the Danilevsky method . .... 
12.6 The method of Krylov 2... 1. ee 
12.7 Computation of eigenvectors by the Krylov method ..... 
12.8 Leverrier’s method ..........2.0200- Se AN: 
-12.9 On the method of undetermined coefficients .......... 
12.10 A comparison of different methods of expanding a secular deter- 
minant 


Cr a oeb‘poeenr o vyb‘k‘koee‘y‘l‘l‘yd‘loall‘lr‘‘ 


12.11 Finding the numerically largesi eigenvalue of a matrix and the 
corresponding eigenvector . . ooo ee eee ee eee 
12.12 The method of ‘scalar products for finding the first eigenvalue of a 
Teal matrix. ose a lk oO a ee ee a eS 
12,13 Finding the second eigenvalue of a matrix and the second eigen- 
VECTOR ipara ge Wea ee use eae alee ayn Terps oh GRD Reet ah a Segoe 
12,14 The method of exhaustion , . 


> = © © © © © eB ew ee ke ee 


358 
364 
367 
373 
375 
380 
384 
384 
389 
393 


394 


394 
397 


398 


400 
403 
405 
409 


410 


410 
410 
412 
418 
420 
421 
424 
426 
428 


429 


430 


436 - 


443 


12 Contents 


12.15 Finding the EES and eigenvectors of a positive definite 
symmetric matrix cs s g aaa ee ee ee 
12.16 Using the coefficients of the characteristic polynomial of a matrix 
for matrix inversion . . 2... ee e te ee 
12,17 The method of Lyusternik for accelerating the convergence of the 
iteration process in the solution of a system of linear equations 
‘References for Chapter 12... . 0... 0... 00 ee eee rae 


CHAPTER 13 
APPROXIMATE SOLUTION OF SYSTEMS OF NONLINEAR EQUATIONS 


13.1 Newton’s method |.. na a aaa ee ee 
13.2 General remarks on the convergence of the Newton process 
*13.3 The existence of roots of a system and the convergence of the 
Newton» Process: sns a aw) keke Gg we Ae A oY oe ee : 
"13.4 The rapidity of convergence of the Newton process. ..... 
*13.5 Uniqueness of solution . , 2... .. eee ee ee ee 
*13.6 Stability of conyergence of the Newton process under variations 
of the initial approximation. ........ Sette eet A te coed 
13.7 The modified Newton method ...... Senge esa tie Ce ey, 
138 The method of iteration ...........2.20-2004- rar 
*13.9 The notion of a contraction mapping ............. 
*13.10 First sufficient condition for the convergence of the process of 
iteratión <<: ets eh GS oe Re te See Falster ee Beg hee bee eB 
*13.11 Second sufficient condition for the convergence of the process of 
iterations o ae y 8. de ea Boe a OR a a A 
13.12 The method of steepest descent (gradient method) ......., 
13.13 The method of steepest descent for the case of a system of linear 
CQUALIONS: sorar oye ak ope ee weg tes See eee Se 
*13.14 The method of power series. .........-. cpio se 
References for Chapter 13....... be fei beds EE ES Re ae ee cae 


CHAPTER 14 
THE INTERPOLATION OF FUNCTIONS .. 1... 2-2 ee eee eee 


14.1 Finite differences of various orders... ....-2.2-070- 
14.2 Difference table 2. ......-..-0-05-+5004- Ope SS 
14.3 Generalized power... 1... 1 ee ee ee ee 
14.4 Statement of the problem of interpolation ........... 
14.5 Newton’s first interpolation formula ............6.. 
` 14,6 Newton’s second interpolation formula, ............ 
14.7 Table of central differences . 2... Ce pee ee er ee ne 
14.8 Gaussian interpolation formulas .,.....4-2..-2 06 = 
14.9 Stirling’s interpolation formula .........., MoE dee 
14.10 Bessel’s interpolation formula .........0.20.08- 5 
14.11 General description of interpolation formulas with constant interval 
14.12 Lagrange’s interpolation formula . . . s so soas 
*14.13 Computing Lagrangian coefficients . . . ..4-4500-+8ea 


445 
450 


453 
458 


459 


459 
465 


469 
474 
475 


478 
481 
484 
487 


491 


493 
496 


501 
504 
506 


507 


507 
510 
517 
518 
519 
526 
530 
531 
533 
534 
536 
539 
543 


Contents 13 


14.14 Error estimate of Lagrange’s interpolation formula ... . . a 547 
14.15 Error estimates of Newton’s interpolation formulas ....... 550 
14.16 Error estimates of the centrai interpolation formulas ...... 552 
14.17 On the best choice of interpolation points . . ... Boog, Sede ts 553 
14.18 Divided differences ...........2...-0 2.07 000-7 554 
14.19 Newton's interpolation formula for unequally spaced values of the 
argument ........-..-+...- Uae ge eects Be BS, seers Mia ay md 556 
14.20 Inverse interpolation for the case of equally spaced points ... 559 


14.21 Inverse interpolation for the case of unequally spaced points . , 562 
14.22 Finding the roots of an equation by inverse interpolation... . 564. 
14.23 The interpolation method for expanding a secular determinant . 565 


*14.24 Interpolation of functions of two variables... ........ 567 
*14.25 Double differences of higher order... ..........248, 570 
*14.26 Newton’s interpolation formula for a function of two variables 571 
References for Chapter 14 2... ee ee ee es 573 
CHAPTER 15 E 
APPROXIMATE DIFFERENTIATION .,. 1... 2. ee et ee es r 574 
15.1 Statement of the problem . . . sooo a 004s 574 
15.2 Formulas of approximate differentiation based on Newton's first 
interpolation formula... . aoao e 575 


15.3 Formulas of approximate differentiation based on Stirling's formula 580 
15.4 Formulas of numerical dilferentiation for equally spaced points . 583 


15.5 Graphical differentiation ©... ......2...2.2.2.4, 586 
*15.6 On the approximate calculation of partial derivatives. . .. .. 588 
References for Chapter 15... .......002+808206 at imate ee FBO 
CHAPTER 16 

APPROXIMATE INTEGRATION OF FUNCTIONS... . . ie GALI" ee is yai 590 
16.1 General remarks ........ Aaaa tan Ea a E eet e 590 
16.2 Newton-Cotes quadrature formulas. . . . o.o s s e aaa 593 
16.3 The trapezoidal formula and its remainder term ........ 595 
16.4 Simpson’s formula and its remainder term . 2... ww 596 
16.5 Newton-Cotes formulas of higher orders ............ 599 
16.6 General trapezoidal formula (trapezoidal rule) .......,. 601 
16.7 Simpson's general formula (parabolic rule). .........., 603 
16.8 On Chebyshev’s quadrature formula ........ Ba ea Bit adh 607 
16.9 Gaussian quadrature formula ..........-....008. 611 
16.10 Some remarks on the accuracy of quadrature formulas ..... 618 
*16.11 Richardson extrapolation ...... (ig aT Gata ene! whade i keg 622 
*16.12 Bernoulli numbers 2... we ee 625 
*16.13 Euler-Maclaurin formula . ). we ee ee 628 
16.14 Approximation of improper integrals ..........0-¢. 633 
16.15 The method of Kantorovich for isolating singularities... 2... 635 


16.16 Graphical integration . 2... ee 639 


14 


"16.17 On cubature formulas 
*16.18 A cubature formula of Simpson type 


Contents 


References for Chapter 16 ... 


CHAPTER 17 


THE MONTE CARLO METHOD . 
17.1 The idea of the Monte Carlo method . 


17.2 Random numbers 


ee ye ee we ee 


17.3 Ways of generating random numbers , 
17.4 Monte Carlo evaluation of multiple integrals... ... were g 
*17.5 Solving systems of linear algebraic equations by the Monte Carlo 


method 


INDEX 


vo. 


soe ee ee 


7 8 ee eH © we we ee le 


ey 


641 
644 
648 


649 


649 
650 
653 
656 


666 
674 
675 
679 


INTRODUCTION 


GENERAL RULES OF COMPUTATIONAL WORK 


When performing computations on a large scale it is important 
to adhere to some simple rules that have evolved over the years 
and are designed to save the time and labour of the computor and 
make for more efficient use of computational machines and auxi- 
liary devices. 

The first step for the computor is to work out a computational 
scheme providing a detailed list of the order of operations and making 
it possible to achieve the desired result in the fastest and simplest 
manner. This is particularly necessary in computational operations of a 
repetitive type where a thoroughly devised scheme makes for speedy, 
reliable and automatic computations and fully compensates the time 
spent in elaborating the computational scheme. Also, a sufficiently 
detailed computational scheme can be competently handled by relati- 
vely inexperienced computors. 

To illustrate the compilation of a computational scheme, suppose 
it is required to compute the values of the analytically specified 
function 


y= f(x) 


for certain values of the argument: x= x,, Xa, ..., x, If the num- 
ber of these values is great, it is not advisable to compute them 
separately, first f{x,), then 7 (x,}, and so on, each time performing 
the whole sequence of operations indicated by the symbol f. It is 
much better to separate f into elementary operations 


F= fal (fel @))--+) 


and carry out the computations as repeated operations: 


u;= fi (x) (=l, 2, tineg n), 
v, = fa (u;) (¢=1, 2, ..., n) 


yY = f m (w) (=1, 2, ...3 n) 


16 Introduction. General rules of computational work 


performing one and the same operation f; (j/=1, 2, ..., m) for. all 
values of the argument under consideration. Then one can make 
wide use of tables of functions and specialized computing devices. 
The results of computations should be recorded on specially designed 
computing forms (computation sheets) having appropriate divisions 
and headings (as applied to the chosen computational scheme). These 
Sheets are filled in with intermediate results as they are obtained, 
and with the final results. 

Computation sheets are usually designed so that the results of 
each series of repeated operations are recorded in a single column 
or row, and the general arrangement of intermediate results is con- 
venient for subsequent computations. 

For example, in order to compile a table of the values of the 
function 

X 
p= SE VTF SmE (1) 
a computation sheet of the type shown in Table 1 might be re- 
commended. 
TABLE 1 
COMPUTATION SHEET FOR FUNCTION (1) 


eT+cos x 
tx? 

(6) : (7) 

sin? x 

wo (4)? 

b+ sin? x 

(1) +(9) 

(8) + (11) 


m 





The computations are performed by column. The character of the 
repeated operations carried out is clear from the computation sheet. 

First, all values of the argument x are recorded in column (1), 
then all the numbers of column (1) are squared and recorded in 
column (2). Then, using tables, the values of e”, sin x, cos x are 
determined in succession for each number of column (1) and are 
entered in columns (3), (4) and (5). 

The subsequent columns give the results of intermediate opera- 
tions, Say, column (6) contains the sums of e*+ cos x [schemati- 
cally, (3)+(5)], etc. The values of the desired function y are 
entered inm the last column, (12). With a properly constructed com- 
putation sheet, the computor does not use the given formula in 


Introduction. General rules of computational work 17 


his calculations since his attention is centered entirely on the se- 
quence of filling in the indicated columns, 

Note that the computational scheme and the form of the com- 
putation sheet are largely dependent on the technical facilities and 
auxiliary tables at hand. For example, in certain cases the sepa- 
rate intermediate results are stored in the memory of a machine 
and are not entered in the sheet. At other times, standard sets of 
operations may be conveniently regarded as a separate operation. 
For instance, in a slide-rule computation, the numerical value of 
an expression of the form 


ab 


c 


may be computed at once without recording any intermediate re- 
sult and so there is no need to split it up into the elementary 
operations of multiplication and division. Similarly, when using 
an electric desk calculator the process of finding the sum of paired 
products 


> Abg 
k=l 


is a single operation. In many cases it is convenient to transform 
the given expressions to a special artificia] form (say, replace divi- 
sion by multiplication by the reciprocal, or bring an expression 
to a form convenient for taking logarithms, and so forth). 
Secondly, an important step is that of checking the computations. 
A computation cannot be considered complete without a check. 
This stage is divided into intermediate checks and the . final check. 
In intermediate checking, we perform certain supplementary ope- 
rations that serve to convince us that the work up to a certain 
point has, to a certain degree of assurance, been performed cor- 
rectly, otherwise the computations of that stage are repeated. The 
final check has to do only with the final result. For example, if 
the root of an equation has been computed, the value found may 
be substituted into the equation and we can see whether the solu- 
tion is correct or not. Common sense tells us that if a computa- 
tion is extensive, it is not wise to confine oneself to a final check, 
thus risking an enormous amount of computational effort. In such 
cases, it is advisable to check in stages. In responsible cases, the 
computations are checked by having the entire job performed by 
two separate computors, or the problem is done by one computor 
in two different ways. l 
A third important point to bring up here is the problem: of 
estimating accuracy. In most cases, computations are carried out 
with approximate numbers and in approximate fashion. Therefore, 


9 9616 


18 introduction. General rules of computational work 


even if the method of solution is exact, there will be errors of 

operation and rounding errors at every step in the computations. 

lf the method is approximate, there will be added an error of 

method. Given unfavourable circumstances, the overall error may 

be so large that the result obtained will be purely illusory. Ap- 

peas chapters of the book give methods for estimating errors 
asic computations. 

‘In the computation sheets, it is useful to provide columns for 
tabular differences (see Sec. 14.2) which may be used as a check. 
It is done this way. If the correctness of the difference table is 
violated in some section, the corresponding table entries should be 
recalculated (or the cause should be sought out). 

One final word on the importance of making neat and legible 
entries in the computation sheets. Experience shows that illegible 
writing leads to blunders that can nullify a well organized com- 
putation. Particularly dangerous are mistakes made when writing 
down numbers with many zeros. Such numbers should be written 
in powers-of-ten notation (scientific notation, as it is sometimes 
called). For instance 


0,00000345 = 3.45. 107° 


and so on. 

The rest of this book is devoted mainly to methods of compu- 
tation. The numerical examples are in many cases simplified and 
intermediate operations are frequently omitted. 


Chapter 1 
APPROXIMATE NUMBERS 


1.1 ABSOLUTE AND RELATIVE ERRORS 


An approximate number a is a number that differs but slightly 
from an exact number-A and is used’ in place of the latter in 
calculations. If it is known that a< A, then a is called a minor 
(too small) approximation of A; if a> A, then a is a major (too 
large) approximation of A. For example, for V2 the number 1.41 
is a minor approximation while 1.42 is a major approximation, 
since 1.41<V2 <1.42. If a is an approximate value of the 
number A, we write axA. ; 

By the error Aa of an approximate number a we ordinarily 
mean- the difference between the exact number A and the given 
approximate number, that is, 

Aa = A—a 


(sometimes the difference a— A is called the error). If A >a, then 
the error is positive: Aa œ> 0, however, if A<a, then the error is 
negative: Aa < 0. To obtain the exact number A, add the error 
Aa to the approximate number a: 


A=a+ Aa F 


Thus, an exact number may be regarded- as an approximate num- 
ber with error zero. 

In many cases the sign of the error is not known. It is then 
advisable to use the absolute error of the approximate number: 


A=] Aa] 

Definition 1. The absolute error A of an approximate number a 
is the absolute value of the difference between the corresponding 
exact number A and the number a: 

A=|A—a] (1) 
Here two cases are to be distinguished: 


(1) the number A is known, and then the absolute error A is 
readily determined from formula (1); 


20 Ch. 1. Approximate Numbers 


(2) the number A is not known, which is most often the case, 
and hence the absolute error A cannot be found from formula (1). 

It is then useful, in place of the unknown theoretical absolute 
error A, to introduce its upper estimate, the so-called limiting 
absolute error. 

Definition 2. The limiting absolute error of an approximate num- 
ber is any number not less than the absolute error of that number. 

Thus, if A, is the limiting absolute error of an approximate 
number a which takes the place of the exact number A, then 


A=|A-—al<A,_ (2) 


From this it follows that the exact number A lies within the 


range ; 
a—A,<A<a+A, (3) 


Hence a— A, is a minor approximation to A, and a+ A, is a major 
approximation to A. 
For brevity, we can then write 


A=a+A, 


Example 1. Determine the limiting absolute error of the number 
a=3.14 which is used instead of the number m. 


Solution. Since we have the inequality 3.14 < n <3.15 it follows 
that |a—n|< 0.01 and, hence, we can take A,=0.01. 
Taking note of the fact that 


f 3.14 <n < 3.142 


wè have a better estimate: A, = 0.002. 

Note that the concept of a limiting absolute error as formulated 
above is very broad, namely, the limiting absolute error of an 
approximate number a is to be understood as any one of an infinity 
of nonnegative numbers A, that satisfy inequality (2). It follows 
logically therefrom that any number exceeding the limiting abso- 
lute error of a given approximate number can also be called the 
limiting absolute error of this number. For practical purposes it 
is convenient to take for A, the smallest number (under the given 
circumstances) that satisfies inequality (2). 

When writing an approximate number obtained from a measure- 
ment, it is common to give its limiting absolute error. For exam- 
ple, if the length of a line segment /=214 cm to within 0.5 cm, 
then we write /=214 cm+0.5 cm. Here the limiting absolute er- 
ror A,=0.5 cm, and the exact magnitude of the length / of the 
segment falls within the range 213.5 cm</<214.5 cm. 

The absolute error (or the limiting absolute error) does not suf- 
fice to describe the accuracy of a measurement or a computation. 
Suppose that in measuring the lengths: of two rods we get 


4.4 Absolute and relative errors 21 


/,= 100.8 cm = 0.! cm and /,=5.2 cm+0.1 cm. Despite the fact 
that the limiting absolute errors coincide, the first measurement 
is better than the second one. An essential point in the accuracy 
of measurements is the absolute error related to unit length. It is 
called the relative error. 


Definition 3. The relative error & of an approximate number a 
is the ratio of the absolute error A of the number to the modu- 
lus (absolute value) of the corresponding exact number A (A #0). 


Thus 
A 
C= Tar (4) 
Whence A=|A|6. 
As in the case of absolute errors, we introduce the notion of a 
limiting relative error. ` 
Definition 4. The limiting relative error 6, of a given approxi- 
mate number a is any number not less than the relative error of 
that number. By definition we have 


S<, (5) 
That is, ar <6,, whence A<]A|6,. 


Thus, for the limiting absolute error of a number a we can take 
A,=|4|6, (6) 


Since, in practical situations, A «a, in place of formula (6) 
one frequently uses 
A,=|al6, (6’) 


From. this formula, knowing the limiting relative error 6,, we 
obtain the limits for the exact number. The fact that the exact 
number lies between a(1—6,) and a(1+6,) is symbolized as - -., 


A=a(i+6,) 
Let a be an approximate number taking the place of an exact 


number A, and let A, be the limiting absolute error of a. For 
definiteness put A>0, a>0 and A, <a. Then 


; A Aa 
ô= A <A —Ag 
We can thus take the number 


Aa 
ôa ~ a— âa 


for the limiting relative error of the number a. 


22 Ch. 1. Approximate Numbers 


Similarly we get A= A6<(@+A)5,, whence 


aĝ 
A= ii 





If, as commonly occurs, A,<a and 6,<1 (the symbol < means 
“very much less than”), then we can take it approximately 
that 


6 


a |F 


gv 


and 
A, = 26, 


Example 2. The weight of 1 dm? of water at O°C is given as 
= 999.847 gf + 0.001 gf (gi =gram (force)). Determine the limi- 
ting relative error of the result of weighing the water. 


Solution, We obviously have A,=0.001 gf and p< 999.846 gi. 


Hence 
0.001 


ô= 999.846 ~ 


x 107% 

Example 3. A result of R=29.25 was obtained in determining 
the gas constant for air. Knowing that the relative error of this 
value is 1°/,, (parts per thousand), find the limits within which 
R lies. 


Solution. We have ôg = 0.001, and so Ap= Rbp 0.03. 
Hence 29.22< R < 29.28. 


1.2 BASIC SOURCES OF ERRORS 


The errors one encounters in mathematical problems may, in the 
main, be broken down into five groups. 

1. Errors involved in the statement of the problem. Mathema- 
tical statements rarely give an exact picture of actual phenomena. 
For the most part they are only idealized models. In studying the 
phenomena of nature we are forced, as a rule, to accept certain 
conditions that simplify the problem at hand. This is a source of 
errors (errors of the problem). 

It sometimes happens that it is either difficult or even impos- 
sible to solve a given problem when formulated precisely. If that 
is the case, it is replaced by an approximate problem yielding 
almost the same results. This is the source of an error termed the 
error of method. 

2. Errors stemming from the presence oF infinite processes in 
mathematical analysis. The functions involved in mathematical 
formulas are frequently specified in the form of infinite sequences 


4.3 Significant digits. The number of correct digits 23 


3 5 f 
or series (for example, sin rox +E... ), What is 


‘more, many mathematical equations can be solved only by desc- 
ribing infinite processes whose limits are the desired solutions. 
Since, generally speaking, an infinite process cannot be completed 
in a finite number of steps, we are forced to stop at some term of 
the sequence and consider it to be an approximation to the re- 
quired solution. Naturally, such a termination of the process gives 
rise to an error. This error is called the résidual error. 

3. Errors due to numerical parameters (in formulas) whose va- 
lues can only be determined approximately. Such, for instance, 
are all physical constants. Let us agree to call this error the 
initial error. 

4. Errors associated with the system of numeration. When depicting 
even rational numbers in the decimal system or some other positional 
system, there may be an infinity of digits to the right of the decimal 
point (generally, radix point). For instance, we may have a nonter- 
minating repeating decimal. It isobvious that we can only use a finite 
number of digits in our computations. This is the source of the so-called 


rounding errors. For example, assuming 5 = 0.333, we get an error 


of A~3-1074. One also has to round off finite multidigit numbers. 

5. Errors due to operations involving approximate numbers (errors 
of operation), When performing computations with approximate num- 
bers, we naturally carry (to some extent) the errors of the original 
data into the final result. In this respect, errors of operation are 
inherent. 

Quite naturally, in a specific problem some errors are absent and 
others exert a negligible effect. But, generally, a complete analysis 
must include all-types of errors. In what follows we will confine 
ourselves largely to computing errors of operation and errors of 
method. 


1.3 SCIENTIFIC NOTATION. SIGNIFICANT DIGITS. 
THE NUMBER OF CORRECT DIGITS 


Any positive number a, it will be recalled, can be represented 
as a terminating or nonterminating decimal: 


a=—a,10"%+a,,-,107-'+a,, 10"? +... +a, 44,107 7 tH+... 
` (1) 
where œ, are the digits of the number a (a,=0, 1, 2, ..., 9), 
the leading digit «, 0 and m is an integer (the highest power 
of ten in the number a). For example, 
3141.59... = 3-10? + 1-10 +4-101-+ 1-10 -+ 
-+ 5.1071 +9.107? +... 


24 Ch. 4. Approximate Numbers 


Each unit occupies a specific position in the number a written 
as the decimal fraction (1) and has a definite value. The unit stan- 
ding in the first position is equal to 10”, that in the second position, 
107-1, in the nth position, 10%-**?, etc. 

‘Actua! cases usually involve approximate numbers in the form 
of terminating decimals: 


b=B,10%+ By 10" +... + Bm—ns10"-"*? (BO) (2) 


Ail retained decimal digits B; =m, m—l1, , m—n-+1) are 
called significant digits of the oe number b; note that 
some of them may. be equal to zero (with the exception of B,,). 
In the decimal positional system of representing the number b, one 
often has to introduce zeros at the beginning or the end of the 
number. To illustrate, 
b=7-10-+ 0-10-44 1-10" + 0- 107° = 0.007010 
or 
b=2-10°+ 0-108 + 0-107 + 3-108 +0- 10° = 2,003,000,000 


The underlined zeros are not significant digits. 


Definition 1. A significant digit of an approximate number is any 
nonzero digit, in its decimal representation, or any zero lying bet- 
ween significant digits or used as a placeholder, to indicate a re- 
tained place. All othe: zeros of the approximate number that serve 
only to fix the position of the decimal point are not to be consi- 
dered significant digits. 

For example, in the number . 0.002080 the first three zeros are 
not significant digits since they serve only to fix the position of 
the decimal point and indicate the place values of the other digits. 
The other two zeros are significant digits since the first lies between 
the digits 2 and 8 and the second (as indicated by the notation) 
shows that we retain the decimal place 107° in the approximate 
number. If the last digit of 0.002080 is not significant, then the 
number must be written as 0.00208. From this point of view, the 
numbers 0.002080 and 0.00208 are not the same, because the for- 
mer has four significant digits and the latter only three. 

When writing large numbers, the zeros on the right can serve 
both to indicate the significant digits and to fix the place values 
of the other digits. This can lead to misunderstanding when the 
numbers are written in the ordinary way. Consider, for instance, 
the number 689,000. It is not clear how many significant digits 
there are, although we can say we have at least three. This ambi- 
guity can be avoided by using powers-of-ten notation (scientific no- 
tation) and writing the number as 6.89. 105 if it has three significant 
digits, or as 6.8900-10° if the number has five significant digits, 


1.3 Significant digits. The number of correct digits 25 


etc. Generally speaking, this notation is convenient for numbers 
containing a large number of nonsignificant zeros, such as 
0.000000120=1.20-10-7, and the like. 

Let us introduce the notion of correct digits of an approximate 
number. ` 


Definition 2. We say that the first n significant digits of an 
approximate number are correct if the absolute error of the num- 
ber does not exceed one half unit in the nth place, counting from 
left to right. Í 

Thus, if for an approximate number a (i), which takes the 
place of an exact numbér A, it is known that 


A=|A—al< 5 107-74 


then by definition the first n digits a, Om- +++» @m-n41 Of 
this number are correct. 

For example, with respect to the exact number A= 35.97, the 
number a= 36.00 is an approximation correct to three digits, since 


1 
|A—a|=0.03 < 5-0.1. 


Note that ih mathematical tables all indicated significant digits 
are correct. For instance, in a five-place table of logarithms the 


absolute error of a mantissa definitely does not exceed 4.107%, ete. 


The term “n correct digits” is not to be taken literally; that . 
_is, it is not necessarily true that in a given approximate number 
a having n true digits, the first n significant digits of it coincide 
with the corresponding digits of the exact number A. For example, 
the approximate number a=9.995, which stands for the exact 
number A=10, is correct to three digits, yet all the digits of 
these numbers differ. However, in many cases we do find that the 
correct digits of the approximate number are the same as the 
corresponding digits of the exact number. 


Note. In some cases it is convenient to say that the number a 
is an approximation to an exact number A to n correct digits in 
the broad sense, meaning by this that the absolute error A=|A—a| 
does not exceed one unit in the nth significant digit of the ap- 
proximate number. 

For example, with respect to the exact number A = 412.3567, 
the number a=412.356 is an approximation correct to six digits 
in the broad sense, since A = 0.0007 < I- 107%. 

In the sequel we will regard the correct digits of an approxi- 
mate number in the sense of Definition 2 (that is:to say, in the 
narrow sense), unless otherwise stated. 


26 Ch. 1. Approximate Numbers 


1.4 ROUNDING OF NUMBERS 


Consider an approximate or exact number a written in the deci- 
mal number system. It is often required to round off this number, 
-which is to say, to replace it by a number a, having a smaller 
number of significant digits. The number a, is chosen so as to 
keep the rounding error |a,—~a| to a minimum. 

Rounding-off rule. In order to round off a number to n signifi- 
cant digits, drop all digits to the right of the nth significant di- 
git, or replace them by zeros if the zeros are needed as placehol- 
ders. In this operation, note the following: 

(1) if the first of the discarded digits isdess than 5, leave the re- 
maining digits unchanged; 

(2) if the first discarded digit exceeds 5, add 1 to the last re- 
tained digit; 

(3) if the first discarded digit is exactly 5 and there are non- 
zero digits among those discarded, add unity to the last retained 
digit; 

(3a) however, if the first discarded digit is exactly 5 and all 
other discarded digits are zeros, the last retained digit is left un- 
changed if even and is increased by unity if odd (the even- 
digit rule). 

In other words, if in rounding off a number we discard less 
than half a unit of the last retained digit, all the retained digits 
are left unaltered; but if the discarded portion of the number is 
more than one half unit of the last retained place, the digit of 
that place is increased by unity. In the exceptional case when 
the discarded part is exactly equal to one-half unit of the last 
retained place, the even-digit rule is invoked in order to compen- 
sate for the signs of errors due to rounding. 

It is obvious that when applying the rounding-off rule, the 
rounding error does not. exceed one half unit in the place of the 
last retained significant digit. 


Example 1. Rounding the number 


n = 3.1415926535... 


to five, four and three significant digits, we get the approximate 
numbers 3.1416, 3.142, 3.14 with absolute errors less than 
5510-4, ș107* and ș:10-. 

Example 2. Rounding the number 1.2500 to two significant digits, 
we obtain the approximate number 1.2 with an absolute error 


equal to 4:10- = 0.05. 


4.5 Relative error of an approximate number 27 


The accuracy of an approximate number does not depend on 
the number of significant digits, but on the number of correct 
significant digits [1], [2]. When'an approximate number contains 
extra incorrect significant digits, one resorts to rounding. The 
following practical rule may be used as a guide: when performing 
approximate computations, the number of significant digits in the 
intermediate results must not exceed the number of correct digits by 
more than one or two units. The final result must not contain 
more than one extra significant digit over the number of correct 
digits. If the absolute error of the result does not exceed two 
units of the last retained place, the extra digit is in doubt, 

The foregoing rule makes for a large saving in time (by dis- 
pensing with extra digits) without impairing the accuracy of the 
computation. Retention of additional digits has the meaning that 
ordinarily an error estimate of the results is made relative to the 
worst version, and the actual error may be appreciably less than 
the maximum theoretical error. Thus, in many cases, significant 
digits that are considered incorrect are actually correct. 

One also rounds off exact numbers that contain either too many 
or an infinity of significant digits, depending on the general re- 
quired accuracy of the computations. l 

Note that if an exact number A is rounded off to n significant 
digits by the rounding-off rule, the resulting approximate number 
a will have 7 correct digits (in the narrow sense). 

Now if an approximate number a having n correct digits is 
rounded off to n significant digits, the resulting new approximate 
number a, will, generally speaking, have n correct digits in the 
broad sense. Indeed, by virtue of the inequality. 


|A—a,|<| A—aj+)a—a,] 


the limiting absolute error of the number a, is made up of the 
absolute error of a and the rounding error. 


1.5 RELATIONSHIP BETWEEN THE RELATIVE ERROR 
OF AN APPROXIMATE NUMBER AND THE NUMBER 
OF CORRECT DIGITS 


We will now prove a theorem that relates the magnitude of the 
relative error of an approximate number to the number of correct 
digits in that number [3], [4]. 


Theorem, Jf a positive approximate number a has n correct digits 
in the narrow sense, the relative error 5 of this number does not 


_ exceed (a) divided by the first significant digit of the given. 


28 Ch. 1. Approximate Numbers 


number, or 
1 \"r-1 


<5 (5) 
where o.,, is the first significant digit of number a. 
Proof, Let the number 
A= hy 10% + Og, 108-34 26. 4 Om ing, LOR"... (Qn 1) 


be an approximate value of the exact number A and let it be 
correct to n digits. By definition we then have 


A=|A—al<-t- 1Q*7 "+2 
whence 


A>a— $107 


This inequality is further strengthened if the number a is rep- 
laced by a definitely smaller number a,, 107: ` 


teenth pmi ie Pami) 0 
The right side of inequality (1) is a minimum for n= 1. Therefore 


or, since 
22n 1 = Unt (Qn I> On 
it follows that 


A> gtn 10" 


Hence 
l . 
— 107-7+1 
A 2 I y\n-2 
628 2 TRA (5) 
A= Tan 10" Gm \10 
Thus 
1 1 \n-4 
s< (m) (3) 


and the theorem is proved. 


Note 4. Inequality (2) may be used to obtain a more exact 
estimate of the relative error 6, 


1.5 Relative error of an approximate number 29 


Corollary 1. For the limiting relative error of the number a we 
can take 
1 i\e 
b= (T) (4) 


where @&,, is the first significant digit of the number a. 


Corollary 2. If the number a has more than two correct digits, 
that is, n>22, then for all practical purposes the following formula 


holds: AES 
ô= zaz (T0) (5) 
1 





Indeed, for n>>2 we can neglect 


‘ori in inequality (1). Then 
AD 5-10". Qe = Ot, 10” . 
whence 
1 
— . 10m-n+1 
A 2 }\n-1 a 
deS ne a (1) 
Consequently - 


1 1 \”71 
5a = za (10) 


Note 2. If the approximate number a is correct to n digits in 
the broad sense, (4) and (5) should be increased by a factor of 2. 


Example 1. What is the limiting relative error if we take a = 3.14 
in place of the number 2? 


Solution. In our case a, =3 and n=3, and so, 


=a)" =e 
Example 2. How -many digits are to be taken in computing V20 
so that the error does not exceed 0.1%? 
. Solution. Since the first digit is 4, we have a, =4 and ô= 0,001. 
We have zm < 0.001, whence 10°71>> 250 and n4. 


This theorem enables us to determine the relative error 6 of an 
approximate number a from the number of correct digits: 


=A, 10" +a, , 10" 24... (6) 


To solve the inverse. problem—to determine. the number of n 
correct digits of number (6) if the relative error 6 is known—we 
ordinarily use the approximate formula 


A 
b=2 (@>0) 





30 Ch. 1. Approximate Numbers 


where A is the absolute error of a. Whence 
A=ad (7) 
Taking into account the leading power of ten in the number A, 


it is easy to establish the number of correct digits in the given 
approximate number a. In fren if 


6x 


T w 


then from formulas (6) and (7) we have 
A < (æn + 1) 107-107” < 10”-77+1 


In other words, a is definitely correct to n decimal places in the 
broad sense. Similarly, if 


SSTT 


then the number a is correct to n places in the narrow sense. 


Example 3. An approximate number a= 24,253 has a relative error 
of 1%. How many correct digits has it? 


Solution. We have 
A = 24,253-0.01 = 243 = 2.43- 10? 


Thus, the number a has only two correct digits (n = 2); the hundreds 
digit is in doubt. According to the rule given above, the number 
a is preferably written as a= 2.43. 10%. 


Note. The foregoing method for determining the number of correct 
. digits is approximate. In an exact count of the correct digits of 
number a, one should proceed from the inequalities 





A 
=—a+a 
and 


A<; 0<6<1) 





1.6 TABLES FOR DETERMINING THE LIMITING RELATIVE 
ERROR FROM THE NUMBER OF CORRECT DIGITS 
AND VICE VERSA 


If an approximate number is written with indicated correct 
digits, then it is easy to compute its limiting relative error. This 
is a frequent requirement in practical computations and so it is 


1.6 Tables for determining the limiting relative error 31 


desirable to simplify the operation. Table 2 [5] indicates the rela- 
tive error as a percentage of the approximate number depending 
on the number of correct digits (in the broad sense) and on the 
first two significant digits ‘of the number, counting from left to right. 


TABLE 2 
RELATIVE ERROR (IN %) OF NUMBERS CORRECT TO n DIGITS 





First two significant digits 





2 3 4 

10-11 10 l 0.1 
12-13 8.3 0.83 0.083 
l4, ..., 16 7.1 0.71 0.071 
I7, a., 19 5.9 0.59 0.059 
20, ..., 22 5 0.5 0.05 
23, ..., 25 4.3 0.43 0.043 
26, ..., 29 3.8 0.38 0.038 
30, ..., 34 3.3 0.33 0.033 
35, ..., 39 2.9 0.29 0.029 
40, ..., 44 2.5 0.25 0.025 
45, ..., 49 2.2 0.22 0.022 
50, ..., 59 2 0.2 0.02 
60, ..., 69 1.7 0.17 0,017: 
70, ..., 79 1.4 0.14 0.014 
80; ..., 89 1.2 0.12 0.012 
90, ..., 99 Tule < 0.11 0.011 


By way of an example, suppose we have an approximate number 
0.00354 correct to three decimals. Since n=3 and the number 35 
lies in the interval 35, ..., 39, from Table 2 we find ô= 0.29%. 

lf only the first digit of the number is known, say 4, then of 
course we take the greater of the numbers 2.5 and 2.2, which 
correspond to possible versions 40, ..., 44 and 45, ..., 49 (for 
n= 2). If the first digit is unknown, then we take the numbers 
from the first row (10%, 1%, and 0.1%) as the largest ones. From 
this table we see that three correct digits ensure a relative accu- 
racy (with an error not exceeding 1%) sufficient for most compu- 
tations. Note that if the approximate number is correct to two, 
three or four digits in the narrow sense, then all the numbers of 
the table must be halved. 

Table 3 [5] gives upper bounds for relative errors (in percent) 
that ensure a given approximate value a certain number of correct 
digits in the broad sense depending on its first two digits. 


32 Ch. 1, Approximate Numbers 


TABLE 3 


NUMBER OF CORRECT DIGITS OF AN APPROXIMATE NUMBER 
DEPENDING ON THE LIMITING’ RELATIVE ERROR (IN %) 


n 
First two significant digits 


2 3 4 

10-11 4,2 0.42 0.042 

12-13 3.6 0.36 0.036 
14, ..., 16 2.9 0.29 0.029 
17, ..., 19 2.5 "0.25 0.025 
20, , 22 2.2 0.22 0.022 
23, ..., 25 1.9 0.19 0.019 
26, ..., 29 1.7 0.17 0.017 
30, ..., 34 1.4 0.14 0.014 
35, ..., 39 1.2 0.12 0.012 
40, ..., 44 1.1 0.11 0.011 
45, ..., 49 1 0.1 0.01 
50, ..., 54 0.9 0.09 0.009 
55, ..., 59 0.8 0.08 0.008 
60, ..., 69 0.7 0.07 0.007 
70, ..., 79 0.6 0.06 0.006 
80, ..., 99 0.5 0.05 0.005 





An example will serve to illustrate how to use Table 3. Suppose 
we have an approximate number a=5.297 with relative error 
&8=0.5%. The first two significant digits here are 5 and 2; the 
number made up of these digits lies between 50 and 54. Depend- 
ing on the number of correct digits, the latter are associated 
with relative errors of 0.9%, 0.09% and 0.009%, ete. Since 
5= 0.5% < 0.9% and the relative error of a number does not 
depend on what decimal places are expressed by the digits of 
the number, the number a= 5.297 is correct to two decimals in 
the broad sense. i 

Example 1. Taking n= 3.142, V 7 = 2.65, e = 2.718, log,, 5= 0.699, 
sin 1°= 0.0174, we find from Table 2 that the corresponding rela- 
tive errors are: ô= 0.033%, 6=0.19%, 5=0.019%, 6=0.17%, 
ô= 0.59%. 

Example 2. From the deflection of a steel rod, Young’s modulus 
has been computed as E=2212... T/cm? (T= metric tons) to 
within 2%. How many digits are correct in the value found? From 
Table 3 we find n==2, and so E= 22-10? T/cm?, 

Example 3. The gas constant R is computed for the explosive 
mixture used in a gas motor. R=31.5... with a relative error of 
6=1%. Find the number of correct digits. From Table 3 we have 
n=2, and so R= 32. 


1.7 Error of a sum 33 


4.7 THE ERROR OF A SUM 


Theorem 1. The absolute error of an algebraic sum of several 
approximate numbers does not exceed the sum of the absolute errors 
of the numbers. 


Proof, Let x,, X} ..., X, be the given approximate numbers. 
Consider their algebraic sum 


u=% Ėt% +.. tr, 
Obviously 
Au = + Ax, + Ax, £... E Ax, 
and, hence, 
| Au | << | Ax,|+|Ax,|-+-... l Ax, | (1) 


Corollary. For the limiting absolute error of an algebraic sum 
we can take the sum of the limiting absolute errors of the terms: 


A= Ay + Ag, +... +Ang (2) 


From (2) it follows that the limiting absolute error of a sum 
cannot be less than the limiting absolute error of the least accu- 
rate term (in the sense of the absolute error), which is to say the 
term having the maximum absolute error. Consequently, no matter 
how high the degree of accuracy of the other terms, we, cannot, 
through them, increase the accuracy of the sum. For this reason, 
it is meaningless to retain extra digits in the more exact terms. 
From the foregoing we obtain the following practical rule for the 
addition of approximate numbers, 


Rule. To add numbers of different absolute accuracy, 

(1) find the numbers with the least number of decimal places and 
leave them unchanged; 

(2) round off the remaining numbers, retaining one or two more 
decimal places than those with the smallest number of decimals; 

(3) add the numbers, taking into account all retained decimals; 

(4) round off the result, reducing it by one decimal. 

When rounding to the mth place the terms of a sum by the 
rounding-off rule, 


U=X,+X+... 44, 
the rounding error of the sum does not exceed 
Around < n-z: 10" (3) 


in the most unfavourable case. 

A more exact calculation of the rounding error of a sum may 
be obtained if we take into account the signs of the rounding er- 
rors of the terms. 


3 9618 


34 Ch. 1. Approximate Numbers 


Example. Find the sum of the approximate numbers 0.348, 0.1834, 
345.4, 235.2, 11.75, 9.27, 0.0849, 0.0214, 0.000354, each correct to 
the indicated significant digits (in the broad sense). 


Solution. We find the least accurate numbers 345.4 and 235.2 
whose absolute error may attain 0.1. Rounding the remaining num- 
bers to 0.01, we get 

345.4 
235.2 
11.75 i 
9.27 
0.35 
0.18 
0.08 
0.02 
0.00 


602.25 


Rounding the result to 0.1 by the even-digit rule, we get the 
approximate value of the sum: 602.2. 

The total error A of the result is made up of three terms: 

(1) the sum of the limiting errors of the original data: 


A, = 1078+ 10-44 10-24 1072 + 1072 + 1072410744 107441078 = 
= 0.221301 < 0.222 


(2) the absolute value of the sum of the rounding err rs of the 
terms (with regard for signs) 


A, = | —0.002 + 0.0034 + 0.0049 + 0.0014 + 0.000354 | = 
. = 0.008054 < 0.009 


(3) the final rounding error of the result:- 
A, = 0.050 
Hence 
A= A, + A, +A, < 0.222 + 0.009 + 0.050 = 0.281 < 0.3 


and thus the desired sum is 602.2 + 0.3. 


Theorem 2. /f the terms have one and the same sign, the limiting 
relative error of theif sum does not exceed the maximum limiting 
relative error of any of the terms, 


Proof. Let pa =x H ee H Xn where, for definiteness, x; > 0 
(i=1, 2,... 

Denote by A (A; >0; i=1, 2, ...,m) the exact magnitudes of 
_ the terms x;, and by A= A + A$.. . +A, the exact value of 
the sum u. Then, for the limiting relative error of the sum we can 


41.8 Error of a difference 35 


take 
Ay Sx, + 4a te HAm 
er = 4,+A,+...+4, (4) 
Since 
Ag 
ôn =T = le 2,..., A) 
it follows that 
An = Abs, (4’) 


Substituting this expression into (4), we get 
Ao, H Ady, +--+ Andy, 
Art Azt -F An 
Let 5 be the greatest of the relative errors ôs, or „<8. Then 


6 (Ait 4Ao+---+A4n) _ 3 
aay Coy Nae oy S 


Consequently, 5, <6, or 
6, < max (Örs Örs -er Örn) 


6, = 


1.8 THE ERROR OF A DIFFERENCE 


We consider the difference of two approximate numbers: U= X, — žy 
From formula (2) of Sec. 1.7, the limiting absolute error A, of 
the diference is 


An az Ax, + Ax, 


That is, the limiting absolute error of a difference is equal to the 
sum of the limiting absolute errors of the diminuend and the sub- 
trahend, l 
Whence the limiting relative error of the difference is 
A, + ån 
ô, = 2 A (1) 


where A is the exact value of the absolute magnitude of the diffe- 
rence between the numbers x, and x, 





Note on the loss of accuracy when subtracting nearly equal numbers. If the 
approximate numbers x, and x, are nearly equal numbers and have 
small absolute errors, the number A is small. From formula (1) 
it follows that the limiting relative error in this case can be very 
great whereas the relative errors of the diminuend and subtrahend. 
remain small. This amounts to a loss of accuracy. 

To illustrate let us compute the difference between two numbers: 
%, =47.132 and x,= 47.111, each of which is correct to five signifi- 
cant digits. Subtracting, we get u = 47.132—47.111=0,021. ` 


36 Ch. 1. Approximate Numbers 


The difference u has only two significant digits, of which the 
last is uncertain since the limiting absolute error of the difference is 


‘A, = 0.0005 0.0005 = 0.001 l 
The limiting relative errors of the subtrahend, diminuend, and 
diference are: 
__ 0.0005 


x = 37.133 æ 0.00001 à 
0.0005 
Ôx, =a a 0.00001, 
_ 0.001 | 
ô= 502 ~ 0.05 


The limiting relative error of the difference is roughly 5000 times 
greater than the limiting relative errors of the original figures. 

It is therefore desirable, in approximate computations, to trans- 
form the expressions in which computation of numerical values 
leads to the subtraction of nearly equal numbers. 


Example. Find the difference 
u=V 2.01 —V 2 (2) 
to three correct digits. 
Solution. Since 
V 2.01 = 1,41774469.., 
and 
V2 =1.41421356... 
the desired result is 
u = 0.00353 = 3.53- 1073 
This result can be obtained by writing expression (2) as 
0.01 
“Yov 


and then finding the roots of V 2.01 and V2 to three correct di- 
gits, as witness, 


o 00l OO in- Sys An 25 

From what has been said we can formulate a practical rule: in 
approximate computations avoid as far as possible the subtraction 
of two nearly equal approximate numbers; if it is necessary to sub- 
'* tract such numbers, take the diminuend and subtrahend with a suf- 
ficient number of additional correct digits (if that is possible}. For 
example, if we desire the diference of two numbers x, and x, ton 


1.9 Error of a product 37 


significant digits and it is known that the first m significant digits 
will disappear by subtraction, then we must start with m--n sig- 
nificant digits in each of the numbers (x, and x,). 


1.9 THE ERROR OF A PRODUCT 


Theorem. The relative error of a product of several approximate 
nonzero numbers does not exceed the sum of the relative errors of the 
numbers. 


Proof, Let -u=x,x,...x, 
cee for the sake of simplicity that the approximate num- 
bers Xis X%,, ...,%, are positive, we have 


Inu=Inx,+Inx,+...+I]nx, 
Whence, using the approximate formula Alnxxdins=— ; 
we get 
a Bg p pAn 


u Xi Xn 





Taking the absolute value of the latter expression, we obtain 


























Au Ax Axa AXp 
ys ae Tla mated te ae 
If A;(i=1, 2, ...,n) are the exact values of the factors x; and 


| Ax;|, as is usually the case, are small compared with x,, we can 
approximately set 


Axi | _ | Ax; 




















Balan 
and 
BH 16 
u 
where 6, are the relative errors of the factors x;(@=1, 2,..., n) 
and 6 is the relative error of the product. 
Consequently 
26,4 84 os 8, (1) 


It is obvious that (1) also holds true if the factors x; (i= 1, 2,..., n) 
have different signs. 


Corollary. The limiting relative error of a product is equal to 
the sum of the limiting relative errors of the factors: 


es Paes oe EE (2) 


If all factors of the product w are extremely exact, with the 
exception of one, then from (2) it follows that the limiting relative 


” 38 Ch. 1. Approximate Numbers 


error of the product will practically coincide with the limiting 
relative error of the least accurate factor. In the particular case of 
only the factor x, being approximate, we simply have 
ô, = Ôx, 
Knowing the limiting relative error 6, of a product u, we can 
determine its limiting absolute error A, by the formula 
A = |u |ô, 
_Example 1. Determine the product u of the approximate numbers 


¥,=12.2 and «,=73.56 and the number of correct digits in it if 
the factors are correct to all written digits. `~ 


Solution. We have Ay, =0.05 and A,,=0.005, whence 


0.05 , 0.005 
6, = BITRE” 0.0042 


Since the product u = 897.432, it follows that A, = u6, = 897. 0.004 = 
=3.6 (approximately). 
And so u is correct to only two digits and the result should be 
written as 
u=897 +4 
Note the special case 
u= kx 


where & is an exact factor different from zero. We have 
ô, = 6, 


A,=|k/A. 


When multiplying an approximate number by an exact factor k, the 
limiting relative error remains unchanged, while the limiting abso- 
iute error is increased |k |-fold. 


Example 2. In aiming a rocket at a target, the limiting angular 
-error is e=1’, What is the possible deviation A, of the rocket 
from the target over a range of x=2,000km in the absence of 
error correction? 


and 


Solution. Here 


Au = 7ap-gp’ 2,000 km = 580 m 

It is clear that the relative error of a product cannot be less 
than the relative error of the least accurate factor. For this reason, 
as in addition, it is meaningless to retain extra significant digits 
in the more precise factors. 


1.40 Number of correct digiis in a product 39 


The following rule is a useful guide: in order to find the product 
of several approximate numbers correct to different numbers of 
significant digits, it is sufficient to: 

(1) round them ‘off so that each contains one or two significant 
digits more than the number of correct digits in the least accurate 
factor; 

(2) in the final result retain as many significant digits as there 
are correct digits in the least accurate factor (or keep one extra 
digit). 

Example 3. Find the product of the approximate numbers x,= 2.5 
and x,= 72.397 correct to the number of digits written. 


Solution. Using the rule, we have, after rounding, x,=2.5 and 
x, = 72.4, whence x,x,=2.5-72.4= 181 = 1.8.10°. 


1.40 THE NUMBER OF CORRECT DIGITS IN A PRODUCT 


Suppose we have a product of n factors (n < 10) u= XXa. > >X 
aes correct to at least m(m-> 1) digits. Assume also that ‘a, 
.., @, are the first significant digits in a decimal representa- 

tien of the factors: 


x,= 0,10" +B,10%-'+... (i=1, 2, 3, ..., n) 
Then by formula e: Sec. 1.5, we have 


big (ey HN n) 


and, hence, 
] 1 
B=g(etat ta) lo)” (y 
Since 4+} ++ <10, it follows that ô, < zo 3) 
Oy Xe ee To Se 


Consequently the product u is correct to m—2 digits in the 
most unfavourable case. 


Rule. If all factors are correct to m decimal places and their 
number does not exceed 10, then the number of_correct (in the 
cine sense) digits of the product is less than m by one or two 
units. 

Consequently if m correct decimal places are required in a pro- 
duct, the factors should be taken with one or two extra digits, 

If the factors are of different accuracies, then m is to mean the 
number of correct digits in the least accurate factor. Thus, the 
number of correct digits in a product of a small number of factors 
(of the order of ten) may be one or two units less than the number 
of correct digits in the least accurate factor. 


~ 40 Ch. 1. Approximate Numbers 


Example 1. Determine the relative error and the number of correct 
digits in the product u=93.87-9.236. 


solution. By formula (1) we have 
1 = 1 = 
S= zlota) rg LO < zy : 10"° 


Hence the product u is correct to at least three digits (see 
Sec. 1.5). 


Example 2. Determine ‘the relative error and the number of correct 
digits in the product u= 17.63-14,285. 


Solution. 
1/1 1 
by zí r) w= = 1.10 


Thus the product will contain at least three correct digits (in the 
broad sense). 


1.11 THE ERROR OF A QUOTIENT 


u=—, then Inu=Inx—Iny 


and 
aur SAt AN 
u x y 
whence 
Au Ax Ay 
robes! 














This formula shows that the theorem of Sec. 1.9 holds true for 
a quotient as well, 


Theorem. The relative error of .a quotient does not exceed the 
sum of the relative errors of the dividend and divisor. 
Corollary. 1t u=}, then ô, = ô, + ôy. 


Example. Find the number of correct digits in the quotient 
u = 25.7:3.6 assuming the dividend and divisor are correct to the 
last digit given, 

Solution, We have 


0.05 0.05 
3.7 +36 


Since u=7.14, then A„,=0.016-7.14=0.11. And so the quoti- 
ent u is correct to two digits in the broad sense, that is, u =7.1 
or, more precisely, ` 


ô, = = 0,002 + 0.014 = 0.016 


u=7.14+0.11 


1.14 Relative error of a root 4i 


1.42 THE NUMBER OF CORRECT DIGITS IN A QUOTIENT 


Suppose the dividend x and the divisor y are correct to at least 
m digits. If æ and 6B are their first significant digits, then for the 
limiting relative error of the quotient-u we can take the quantity 


Sul fay ies 
= 3 (atp) (a0) 
From this we get. the. rule: (1) if a2 and B52, then the 


quotient u is correct to at least m—1 digits; (2) if a =] or B=1, 
then the quotient u is definitely correct to m—2 digits. 


1.13 THE RELATIVE ERROR OF A POWER 


Suppose w= x” (m any natural number), then Inu=mInx and, 
hence, f 
Ax 


x 


bu 
a 














From this we have 
ô, = m6, qd) 


or the limiting relative error of the mth power of a number is m 
times the limiting relative error of the number. - 


1.14 THE RELATIVE ERROR OF A ROOT 


~ 


Now suppose u=4/x; then u”=x, whence 
l l 
í ôs = -7 s (1) 
That is, the limiting relative error of a root of index m is m times 


less than the limiting relative error of the radicand. 


Example. Determine with what relative error and with how many 
correct digits we can find the side a of a square if its area s = 12.34 
to the nearest hundredth. 


Solution, We have a=Vs=3.5128... . Since 


. 0.01 


it follows that ô, = -+ 6,=0.0004. Therefore 


A, = 3.5128-0.0004 = 1.4.1073 


And so the number a will have about four correct digits (in the 
broad sense) and, hence, a = 3.513. 


a Ch. 1. Approximate Numbers 


4.15 COMPUTATIONS IN WHICH ERRORS ARE NOT TAKEN 
INTO EXACT ACCOUNT 


In the preceding sections we indicated ways of estimating the 
‘limiting absolute error of operations. It was assumed that the ab- 
solute errors of the components strengthen one another, but actually 
this is rarely the case. 

In large-scale computations, when the error of each separate re- 
sult is not taken into account, it is advisable to apply the follo- 
wing rules for counting digits [6]. 

1. When adding and subtracting approximate numbers, the last 
retained digit in the result must be the largest of the decimal 
orders expressed by the last correct significant digits of the ori- 
ginal data. 

2. When multiplying and dividing approximate numbers, retain 
in the result as many significant digits as there are in the given 
approximate number having the smallest number of correct signi- 
ficant digits. j 

3. When squaring or cubing an approximate number, retajn as 
many significant digits in the result as there are correct significant 
digits in the base of the power. 

4. When taking square or cube roots of an approximate number, 
take as many significant digits in the answer as there are correct 
digits in the radiċand. 

5. In all intermediate results, keep one more digit than recom- 
mended by the rules given earlier, Discard this additional digit 
in the final result. 

6. When using logarithms, count the number of correct signifi- 
cant digits in the approximate number having the smallest number 
of correct significant digits and use tables of logarithms to one 
place more than that number. The last significant digit is discar- 
- ded in the answer. ` 

7. If the data can be taken with arbitrary accuracy, then in 
order to obtain a result correct to k digits, take the original data 
with the number.of digits such that according to the earlier rules 
we obtain k+ 1 correct digits in the answer. 

If some of the data have extra lower-order digits (in addition 
and subtraction) or more significant digits than the others (in 
multiplication, division, powers or roots), then they must first be 
rounded off so that one additional digit is retained. 


1.46 GENERAL FORMULA FOR ERRORS 


The prime objective of the theory of errors is: given the errors 
of a certain set of quantities, to determine the error of a given 
function of these quantities. 


1.16 Genera! formula for errors 43 


Suppose a differentiable function 
ú= f (x; Xoy ee Xn) 


is given; let|Ax,| (i=1, 2, ..., n) be the absolute errors of the 
arguments of the function. Then the absolute error of the function is 


[Au|=|f (x, + Ax, Xa + AX, HER Xa + AX,)—f (Xo Xg» E i Xn) | 


In practical situations, |Ax;| are ordinarily small quantities 
whose products, squares and higher powers may be neglected. We 
can therefore put 








; “ 0, ` 0, 
| Awl ældi (ty) Xa eos ral=|S Aan fs S| Elian 
ip i=l 
Thus 
[Au] < (1) 
From this, denoting by Ay, (i=1, 2,-..., n) the limiting abso- 


lute’errors of the arguments x, and by A, the limiting error of the 
function u, we get, for small Ax,, 


a 

ou 

a= 2s 
f= 


Dividing both sides of inequality (1) by u, we get an estimate 
for the relative error of the function u: 


<È 


Hence, for the limiting relative error of the function u we can 


take 
= D at inu] du 


Example 1. Find the limiting absolute and relative errors of the 
volume of a sphere V= nd if the diameter d= 3.7 cm + 0.05 cm 
and x æ 3.14. 


Solution. Regarding a and d as variable quantities, we compute 
the partial derivatives 





Ax; (2) 


[ed È | ae loi (ay, sch (Ax) ° 9) 





Ox; 
u 





(4) 





X cyt = 8.44, 
2E d nd = 21.5 


44 Ch. 1. Approximate Numbers 


By virtue of formula (2), the limiting absolute error of the vo- 
lume is 
Ay=|-3e || Aal-+] | [Ad] =8.44-0.0016 + 

+ 21.5-0.05 = 0.013 + 1.075 = 1.088 cm? œ 1.1 cm’ 
and so 
Vip nd? ~ 27.4 cm? + 1.1 cm? (5) 


Whence the limiting relative error of the volume is 


3 
by = LOS om = 0.0397 ~ 4% 


Example 2. Young’s modulus is determined from the deflection of 
a rod of rectangular cross-section by the formula 


_ 1 Bp 
E=7 bs 





where ¿ is the length of the rod, a and b are the dimensions of 
the cross-section, s is the bending deflection, and p is the load. 
Compute the limiting relative error in a determination of Young’s 
modulus E if p= 20 kgf, 6,=0.1%, a=3 mm, 6,=1%, b=44 mm, 
‘6,=1%, 1=50 cm, 6=1%, s=2.5cm, 6,=1%. 
Solution. In E=31n/+1n p—3 Ina—1nb—Ins—~In4. 
Whence, replacing the increments by differentials, we get 
AE AL, A Aa Ab As 
Ee php a es 
Hence 
ôs = 36, + p + 36, + 6, + ôs = 3-0.01 + 0.001 + 
+3-0.01+0.01+0.01=90.081 


Thus, the limiting relative error constitutes 0.081, which is 
roughly 8% of the quantity being measured. 
Performing the numerical computations, we obtain 


= kgt 
E = (2.10 +0.17)-10° 3 


1.17 THE INVERSE PROBLEM OF THE THEORY OF ERRORS 


Also of practical importance is the inverse problem: what must 
the absolute errors of the arguments of a function be so that the 
absolute error of the function does not exceed a given magnitude? 

This problem is mathematically indeterminate since the given 
limiting error A, of a function u=/(x,, x,, ..., Xa) may be en- 
sured by establishing the limiting absolute errors A,, of its argu- 
ments in different ways. ; 


1.47 Inverse problem of theory of errors 45 


The simplest solution of the inverse problem is given by the so- - 
called principle of equal effects. According to this principle, it is 
assumed that all the partial differentials 


DE Rie He ST Die teed) 


contribute an equal amount to the total absolute error A, of the 
function u= f (Xi, Xo <- Xa) 

Suppose the magnitude ‘of the limiting absolute error A, is given. 
Then, on the basis of formula (2) of Sec. 1.16, 


iy Ou 


Ox; 
t=] 








As (1) 


Assuming that bie baa are equal, we have 























= Ar, =| A 
whence 
E Gan Dani (2) 
n pos 
Ox; 








Example 1. The base of a cylinder has radius R ~ 2 m, the altitude 
of the cylinder is Hæ 3m, With what absolute errors must we 
determine R and H so that the volume V may be computed to 
within 0.1 m?? 


Solution. We have V=aR?H and Ay =0.1 më. 
Putting R=2 m, H=3m, n=3.14, we get, approximately, 


ov 

== RH = 

i =2nRH =37.7, 
x = 7R? = 12.6 


From this, since n=3, we have, on the basis of formula (2), 


A, = 3-5 < 0.003, 
Ar = z577 < 0.001, 
An = z =z < 0.003 


Example 2. It is required to find the value of the function 
u = 6x? (log,, x— sin 2y) 


46 Ch. 1. Approximate Numbers 


to two decimal places; the approximate values of x and y are 15.2° 
and 57°, respectively. Find the permissible absolute errors in these 
quantities. 

Solution. Here 

u= L gi 2y) = 6 (15.2)? logs 15.2—~sin 114°) = 371.9, 


oH 12x (10g, x— sin 2y) + 6xM = 88.54 


where M =0.43429 is the modulus of common logarithms; 


z= 12x? cos 2y = + 1127.7 
For the result to be correct to two decimals, the equation A,= 
=0.005 must hold. Then, by the principle of equal effects, we have 








Ay 0.005 
A, = aT = Tet ‘0.000028, 
2 
ox 
bu _ 0.005 » 
A, = yeu] = Sia = 0.0000022 radian = 0”.45 
oy 








lt often happens that when solving the inverse problem by the 
principle of equal effects we may find that the limiting absolute 
errors [found from formula (2)} of the separate independent variables 
turn out to be so small that it is practically impossible to attain 
the necessary accuracy in measuring these quantities. In such cases 
it is best to depart from the principle of equal effects and by rea- 
sonably diminishing the errors of one part of the variables increase 
the errors of the other part. 


Example 3. How accurately do we have to measure the radius of 
a circle R=30.5 cm and to how many decimals do we have to 
take x so that the area of the circle is found to within 0.1°/,? 


Solution, We have s=aR? and Ins=Inn+2InR, whence 


As An, AR 
= 458 = 0.001 


S 


By the principle of equal effects, put 
*=— 0.0005, “BE = 0,0005 


whence A, < 0.0016 and eee cm. 

Thus, we have to take n = 3.14 and measure R to within thousandths 
oÏ a centimetre. It is plain that such accuracy of measurement is 
extremely difficult, and so it is better to do as follows: take n= 


1.17 Inverse progiem of theory of errors 47 


= 3.142, whence 2 = 0. 00013; then Ro 0.001 — 0.00013 = 0.00087 


and Ag < 0.013 cm. Accuracy of ae degree can be achieved with 
relative ease. 

Sometimes the limiting absolute error of all arguments x,(@@=I, 
2, ..., A) is allowed to be the same. Then, putting 


Ax = Âr, =... =Ax, 


we get, from formula (1), 





en i=1, 2... n) 
3 


Finally; it may be assumed that the accuracy of measurements 








Ox; 


of all arguments x,(i=1, 2, ..., n) is the same, that is, the li- 
miting relative errors 6,,(i=1, 2, ..., n) of the arguments are 
equal: 


Op = Oe pn Oy 


From this we obtain 
A A, A 


faq] [xe] DA 1Xn | 





Xn 


where & is the common value of the ratios. 
Consequently 


Ax, =k] x;| (ah nae N) 


Putting these values into formula (1), we get 


ft 

‘a Ou 

A,=R >| sae 
i= 











and 
a a 
g _ Ou 
Edi 
And we finally have 
An EE (i=1, 2, ..., n) 
3 p ie 








Other versions can also be utilized. 


48 Ch. 1. Approximate Numbers 


The solution is similar for the second inverse problem of the 
theory of errors when the ‘limiting relative error of a function is 
given and the limiting absolute or relative errors of the argument 
are sought. a 

Occasionally, there are conditions in the very statement of the 
problem that prevent one from applying the principle of equal 
effects. 

Example 4. The sides of a rectangle area œ~ 5m and bæ 200 m. 
What is the permissible limiting absolute error in measuring these 
sides (the same for both sides) so that the area S of the rectangle 
can be determined with a limiting absolute error of As =1 m?? 


Solution. Since 


S= ab 
it follows that 
$ AS =~ bAa 4- aAb 
and 
As =bA, + 4A, 
By hypothesis, 
Aa F Ay 
and so 
A 1 
Aa =T æ 0.005 m =5 mm 


1.18 ACCURACY IN THE DETERMINATION OF ARGUMENTS 
FROM A TABULATED FUNCTION 


It is often necessary in computational work to determine an ar- 
gument from the value of a tabulated function. For example, we 
are constantly called upon to determine a number from a table of 
logarithms or an angle from the tabular value of a trigonometric 
function, etc. Naturally, any error in the function causes an error 
in the determination of the argument. 

Suppose we have a table of single entry for the function y = f (x). 
If the function f(x) is differentiable, then for sufficiently small! 
values | Ax] we have 


|Ay|=|f' (x) || Ax| 
wherice 
el Tray o) 


or 


1 
As = Ay 


1.18 Determination of arguments from tabulated function 49 


Let us apply formula (1) to some of the more common tabula- 
ted functions. 


A. LOGARITHMS 
Suppose y=Inx, then y’ =1 ; 
From this, 
A,=xAy (2) 
But if y = logo x, then y' =“ where M = 0.43429; 


1 , 
A, = jj thy = 2.30 xA, (2’) 


Whence, in particular, we obtain 6,=2.30A,; that is, the limi- 
ting relative error of the number in the table of common ‘logarithms 
is just about equal to 2!/, times the limiting absolute error of the 
logarithm of this number. 


B. TRIGONOMETRIC FUNCTIONS 


i. If y=sinx(0 << 7) , then y’ =cosx and hence 
A, = Ay sec x radians (3) 
2. For the function 
y=tanx (0 <x< 5) 
we have 
y' = sec? x 
and 
A, = A, cos? x radians (4) 


3. If y= logy (sinx) (0 <x< 5) , then 
y =Mcotx and A,=2.30 tan xA, radians (5) 


4. Putting y= log, (tan x) (0 << 5) , we have 


,_ 2M 
T gin 2x 





~ and A,=1.15sin 2x4, radians (6) 


ia 


Since it is obvious that = < tan x for 0< gz z?’ it follows 


from formulas (5) and (6) that the angle x is more accurately de- 
termined from a table of log tangents than from a table of log 
sines. 


4 9616 


50 Ch. 1. Approximate Numbers 


C. EXPONENTIAL FUNCTIONS 
lf y=e*, then y’=e* and 


A= (7) 
or 
j A 
A, = = 


Example 1. To what accuracy can we determine the number 
x æ 5000, using a four-place table of common logarithms? 


Solution. From (2’) we get 
A, = 2.30. 5000-4. 1074 œ 0.6 


which means the number x is correct to roughly four digits. 


Example 2. Find the error in a determination of the angle x = 60° 
from: 

(a) a five-place table of log sines, 

(b) a five-place table of log tangents. 


Solution, For the first case, we have, from formula (5), 

A, =2.30-V 3-5-10-® radian = 0.00002 radian = 4” 
In the second case, using (6), we get 

A,= 1.15: 3-107 radian æ 0.000005 radian œ%1” 


which is one fourth the preceding error. 


1.19 THE METHOD OF BOUNDS 


The commonly used error estimation of a function [Sec. 1.16, 
formula (2)] is approximate, since this estimate is based on neg- 
lecting the products of errors. In certain cases we want to know 
the exact bounds of the required value of the function if the range 

`of its arguments is known. The simplest way to attain this is via 
the method of double computation, also called the method of bounds. 

Let l 


l u= f (Xy Xar vee, Xn) 


be a continuously differentiable function monotonic with respect 
to each argument x;(i=1, 2, ..., n). For this purpose it suffices 


to assume that the derivatives gt (i=1,2, ..., n) retain the same - 


41.19 Method of bounds 51 


sign throughout the range w of the arguments. Suppose that 
easy = 1, 2,4. n) (1) 
and the parallelepiped (1) lies entirely in the domain œ. 

Assume that x;= x; x;=%, if the function f is an increasing 
function with respect to the variable x; and x,=<x;,, X=; if the 
function f is a decreasing function with respect to the variable x;. 

Then obviously 7 

uuu (2) 
where 


u=} (Ši Fy oe E) 
and E 

: u= f (Xis Xar cees Xn) 
Note that the variables x; (i=1, 2, ..., n) and the result of the 
operations of f on them can only be rounded in the sense of roun: 
ding down the quantity u, and the variables x,(i=1, 2, ..., n) 
and the result of the operations of f on them, only in the sense 
of rounding up the quantity u. Strict fulfillment of inequality (2) 
will then be guaranteed. In the particular case when the furiction f 
is monotonically increasing with respect to. each argument 
x;(i=1, 2, ..., n), we simply have 


Í (Ei Kay ever Za) LUL Ey Sey oe Xa) (3) 


Example. An aluminium cylinder with base diameter d= 2cm + 
+0.0i cm and altitude hR=11 cm = 0.02 cm weighs p= 93.4 gf + 
+ 0.001 gf. Determine the specific weight y of aluminium and. 
estimate its limiting absolute error. 


Solution, The volume of the cylinder is 


whence 
4 


From formula (4) it follows that in the range p>0,d>0,A>0, 
the function y is increasing with respect to the argument p and 
decreasing with respect to the arguments d and A. By hypothesis 
we have , 
199cem<d< 2.01 cm, 
10.98 cm <A < 11.02 cm, 
93.399 gf < p {93.401 gf 


52 Ch. 4. Approximate Numbers 


Besides, 
3.14159 <n < 3.1416 
And so 
4.93.399 
Y= sae aor e271 in 
(too small) and 
z 4-93.401 supe Es 


Y= 3714159-1.99 10.98 
(too large). Taking the arithmetic mean, we get- 
ee + 0.027 -£ (5) 


em? 

or, after rounding, 
y= 2.704, +.0.03-& 

By way of comparison we give an eres estimate of the 


error. Using the mean values of the arguments, we obtain 


_ 4934 gf 
= 31416-2211 2.703 <3 


Taking logs of (4), we have 
Iny=1n4+Inp—Inn—2Ind—InhA 


whence, taking the total differential, we obtain 
Ay Ap An 2Ad_ Ah 


Consequently 
0.001 0.00001 0 
6, = Op + Ör t 284+ 8 = Gag + e = 
= 1.07-107-5+ 3.18.1078 4+ 1072+ 1.82. w GESS 1072 
Then we find 
= 2 f 
A, = 6,-y = 1.183-10 2,2.703=3.2-10 E 





Thus we have, approximately, 


f 
y=2.703-5 +0. 032 2, 


which is very close to the exact evaluation (5). 


“4,20 THE NOTION OF A PROBABILITY ERROR ESTIMATE 


Suppose we have a sum of n terms: , 


U=4X,1%,+... +X, 


1.20 Notion of a probability error estimate 53 


Then, as we know, the limiting absolute error of the sum is 


A, = Ag, + Ax, + -+- + As, (1) 
Whence, when the limiting absolute errors of the terms are the same, 
A, =A, = +. =Ag=A 
we have 
A,=nA (1’) 


Formula (1) yields the maximum possible value of the absolute 
error of the sum. This limiting error is only attained when the 
errors of all the terms: (1) are the largest possible, and (2) have 
the same signs. Given a large number of terms, such an unfavou- 
rable coincidence has only a remote possibility. Actually, the errors 
in the separate terms are, as a rule, of opposite sign and conse- 
quently partially neutralize one another. For this reason, besides 
the theoretical limiting error A, of a sum, we introduce the prac- 
tical limiting error Aj, which occurs with a certain measure of 


certainty. 
We confine ourselves to an elementary case. Let the absolute 
errors Ax;(i=1, 2, ..., n) of the-terms of sum (1) be independent 


and obey the normal law with the same measure of accuracy. 
Assume that the absolute errors of the terms do not exceed the 
number A with a probability exceeding the number y; that is, 


P([Ax;|<A) >Y 


Given this condition, probability theory proves that the absolute - 


error of the sum u will satisfy the inequality | Au| SAV n, where n 
is the number of terms, with the same measure of certainty. 

Thus, for the limiting absolute error of a sum we can take the 
number 


At=AV n (2) 


For example, in adding 100 numbers with absolute error 0.1, we 
get a theoretical limiting error of the sum A, equal to 0.1-100= 10. 
Actually, we may expect that this error does not exceed 0.1-10=1. 

As a particular instance, consider the arithmetic mean of n 
numbers: 


B= (x tpt... +o) 
According to strict theory, the limiting absolute error is 


A:=—-nA=A 


54 Ch. 1. Approximate Numbers 


whereas with great certainty we can assert that in actuality 
A: =A Vn A 





1 y h 
which is to say that we may be practically certain that the arith- 
metic mean of approximate numbers is more accurate ihan the num- 
bers themselves, and 


Stes A; — 0 as n — œ 





Similarly, for the case of multiplying n factors with the same 
limiting relative error 6, we can prove that the practical limiting 
relative error of the product is given by the formula 


a= 6V n (3) 


REFERENCES FOR CHAPTER 1 


[I] A. N. Krylov, Lectures on Approximate Computations, 1933, Chapter I 
(in Russian). 

[2] D. A. Ventsel, E. S. Ventsel, Elements of the Theory of Approximate Com- 
putations, 1949, Chapter 1 (in Russian), 

[3] James B. Scarborough, Numerical Mathematical Analysis, 1955, Chapter I. 
[4] Ya. S. Bezikovich, Approximate Computations, 1949, Chapters I. and II 
(in Russian). ; 

[5] G. M. Fikhtengolts, Mathemalics for Engineers, 1933, Part 1, Chapter I 

(in Russian). 
[6] V. M. Bradis, Oral and Written Computing. Computational! Aids, Encyclo- 
pedia of Elementary Mathematics, Book 1, 1951 (in Russian). 


Chapter 2 


SOME FACTS FROM THE THEORY 
OF CONTINUED FRACTIONS 


2.1 THE DEFINITION OF A CONTINUED FRACTION 


An expression of the form 


l. 1 bB h 
akao Shaa a aa] (1) 
a, F bs 
dg+... 


is called a continued fraction. The following abbreviated notation 
is also used for the continued fraction (1): 


bil, ba] 
mT tt 

In the E case, the elements of a continued fraction a,, ap 
6,(k=1, 2, ...) are real or complex numbers, or functions of one 


or’ more Varixbled: The fractions a, =, 1 (k=1,2,...) are called 


components of the continued patton ay (the o first, etc.), 
and the numbers of functions a, and 6,(& > 1) are called the terms 
of the Ath component (partial denominators or numerators). We 
will assume that a0. Note that in the abbreviated notation 


(1) the components "k cannot be reduced. 


If the continued fraction (1) contains a finite number of compo- 


nents (say n, not counting the zeroth one), it is called a finite or 
n-component continued fraction and is symbolized compactly as 


. bı bo ba) . o,\” 
fo Belew 
A finite continued fraction is identified with the corresponding 
common fraction obtained by performing the indicated operations. 


A continued fraction (1) having an infinity of components is termed 
an infinite continued fraction and is denoted as 


|as; |" | (3) 


56 Ch, 2. Some Facts from the Theory of Continued Fractions 


The continued fraction 


| ain a 
i aay = fa; a,’ Bree] (4) 
grits 


where all partial numerators are equal to 1 is termed a simple 
(or standard) continued fraction. The denominators of the components 
are called partial quotients. Note that in the theory of numbers, 
partial quotients are usually natural numbers (positive integers). 


2,2 CONVERTING A CONTINUED FRACTION 
TO A SIMPLE FRACTION AND VICE VERSA 


Any finite continued fraction may be converted to a simple 
fraction. To do this, simply perform all the operations indicated 
by the continued fraction. 


Example 1. Change the continued fraction 
l l 
35. P t| =3+— 
| 3? T?! 4 34 
pe 
to a simple fraction. f 
Solution. Performing the indicated T in succession, we get 


Q)l4p52, 41: Psk, 
(0) 1: $=%, omits z 
(3) 3+4=F, 


Hence 


Conversely, any positive rational number may be converted to 
a continued fraction with natural elements. Suppose, for example, 
we are given the fraction A Eliminating the integral part a, 
we have 


Pg o 
g TAR q 
where r, is the remainder (i2 is a proper fraction then a,=0 


and =p] . 


2.2 Converting a continued fraction to a simple fraction 57 


Dividing the numerator and denominator of the fraction ? by Fro 


we have 





where a is an integral quotient and r, is the- remainder left from 
dividing q by rẹ 
Dividing the numerator and the denominator of the fraction 
Z by r, we obtain ; 
To 7 5 
m1 1 








To OW Te 
ry 
where a, is an integra] quotient and z, is the remainder left from 
dividing r, by r,. The process may be continued in similar fashion. 
Since gq >r >, >, >r, >... and r,(@=0, 1, 2, ...) are po- 
‘sitive integers, we will finally have r,=0, or 
in-1 _ 1 
Tn- An +0 


Substituting the expressions of the fractions a, we get 
i-1 








r 1 
Fa W th Oy = Ay + 


= 
7 ae arene 
Ge. 
1 
=+ i 
ay ae 
$ 1 
an 


62 . x P F 
Example 2. Convert T into a continued fraction. 


Solution. We have, successively, 


E E E E a 234+ 334+ 
5 a RSS REE 
Ea 1+ 
4 4 
62 i 1 1 
Thus, jo Ogi ee F 
Genera! continued fractions are transformed analogously. 


58 Ch. 2. Some Facts from the Theory of Continued Fractions 


Example 3. Convert the continued fraction 


p o — xè x E e x2 
a ig ee a a os 





Ba 
a 


to a simple one. 
Solution. We have 








x? 5x? 15— 6x? 
(i= a pes oe 
-5 ) 
x? = 15x2?— xt _ 15—21x?+ x° 
(2) 1— 15— 6x2 =]— ee < Be 
15—x? 


and so 








L: =x? —x — x8) 15 —21x? 4 x4 
7 1? 3? 5 J~ I5— 6x? 


2.3 CONVERGENTS 


Suppose we have a terminating or nonterminating continued 


fraction 
by |^ 
[os o 
The simple fraction 
Ps — |a: ® =| 
ats [a TE- 


(k=1, 2, ...), where k<n, is called the kth convergent of the 


continued fraction (1), Following Euler, we usually set 
Po % Pad 
Ql’ Qı 0 
and for definiteness we assume that 





P, = lo, Qo =1 (2) 
and 
P_=1, Q.,=0 (2°) 
When using a digital computer, the convergents 
b 
E = Q + R 
s atz F 


2.3 Convergents 59 


are conveniently found with the aid of Horner’s scheme (see Chap- 
ter 3) for. division: 


nn d, =G,-4+C, 


C= ay,’ d d, = lp -3 F Cp» 





Canar mri dp = A, — p+ Ch 
_ b Pa 
7 dy = Oy + Cy = g 


The indicated sequence of operations can readily be programmed. 


Theorem 1. (Law of formation of convergents}. The numbers 
Po Q, (k=—1, 0, 1, 2, ...), determined from the relations 


Pk = aP hni +O,P pw (3) 
Qk = A Qe_-1 F Ekk- (3’) 

- where x : 
P_,=1, Q_,=9, P,=a, Q=! (4) 


ares respectively, the numerators and denominators of the convergents 
o of the continued fraction (1). We shall call such convergents 
canonical. 

Proof. Let R, (k=1, 2, ...) be the successive convergents of 
the continued fraction (1). It is required to prove that 


Re= oe (k=1, 2, ...) 


We carry out the proof by the method of mathematical induction. 
When k=1 we have, for the convergent R,, i 


Ar bi Apa + by 
ae Tee ae 


On the other hand, from relations (3) and (3’), we get, taking into 
consideration (4), 

P, =a% + b,, 

Q,= 4,-1+6,-0=a, 


Hence, paa and for k=1 the assertion of the fheorem holds. 


Now let the theorem be true for all natural numbers not excee- 
ding k. We will show that the theorem also holds true for the 


60 Ch. 2. Some Facts from the Theory of Continued Fractions 


natural number k+ 1. From (3) and (3’) we obtain 
Phir =AgerPe t+ Op—+iP eis 
Qtr = ak4 k t+ Ons Qe-i 

By the induction hypothesis we have 

. R, = Pr Pk- tH OkPR-e 


Qk  AkQr-i + bkk- 


By the law of formation of continued fraction (1), the convergent 
Rei, iS obtained from the convergent R, by replacing the term a, 





by the sum Op +o Pet Therefore 





b 
( ap ae Pp-1+bkPk-e 


ape 


(a, } Qp-1 t+ bkk- 


a 


= 


Rea 


— Irti (OpPp-1t OgP po) + Op 41PR-i Uti Pe+Oe +1 Pri Prts 
Gps (Og Qp—a be Qe— 2) Op t2Qp—-2 pt 1Qe Fbk+iQk-1 Qeri 


- which completes the proof. 


Note. Since the terms of the convergents are not defined uniquely, 
one cannot, in the general case, assert that the numerators and 
denominators of convergents of noncanonical type satisfy the equa- 
tions (3) and-(3’). In the sequel we assume that the convergents 
under. consideration are canonical. 


Corollary. For the simple continued fraction 


1 
m+— 


atap 


the numerators and denominators of its convergents PA (=I, 2,265) 
can be determined from the relations 
Pr = akPr-1 H Pr-o \ 
Ir = aklk-1 F fk-a 


where we put p,=@, p_,=1, and g,=1, g_,=0. 


Note. The following scheme is convenient for finding the terms 
of successive convergents from formulas (3) and (3’). 


(3”) 


2.3 Convergents 61 





In this scheme, the row b, is omitted for a simple continued 
fraction where b=1 (k=1, 2, ...) and the formulas (3”) hold. 


Example 1. Compute all the convergents of the continued fraction 


1+ 
3+ 


l 
1 


4f— 
“Te 


Solution. Using the scheme, we obtain 








Thus 


Example 2. Find all the convergents of the general continued 


fraction 
] 3 5 7 
[0% Qa p’ 5] 


62-. Ch. 2. Some Facts from the Theory of Continued Fractions 


Solution. Using the foregoing scheme, we have 














Whence 
PO Pe, Pet. P, 37 P, 620 
Q 1’ Q 2’ Q 1l? Q 98’ Q, 1645 
Theorem 2. The formula 
Pr Pr- k-1 bibe. -bh r 
Qe Üp- =a. Qk-1Qz ee) 4M) 
` holds true for two successive convergents oct and > 5 Of the conti- 
nued fraction (1). 
Proof. We have 
Pe Pe-r _ Ak 
Qk Qk-  Ok-1Qk (5) 
where - 
Pa Pr 
A,= k k-1 
Qr- 








Using relations (3) and (3’), we get (by virtue of familiar deter- 
minantal properties) 


Zz |QgPe-rt+OePe-2 Pry} 
A,Q4-1 + bkr- Qr-1 
From this we successively obtain 

Ar = (— bi) (— br-1) - «(= By) Ao = (I bib. . gg 


Prs Piz 


k Qg- Qr- oy ROR-1 


JOR 














where 
P, Pai|_l&a I 


A, = z 
Q, Q| |1 0 





|=- 


2.3 Convergents 63 


Thus 
A= (—1)#7 tb b. . by 
and, consequently, on basis of es (5) we conclude that 


Pe k- 101b,.. Dp bg 
Qr -g= ol) “pet Os 





Corollary 1. li oo + and z (k > 1) are two successive convergents 
of the continued Adon o then 
Ar = Ppr- Pr- = (— 1) tbb.. br 
Corollary 2. For two successive convergents ae Pk (k> 1) of 


a simple continued fraction, the following equation holds true: 


Pr Pk- 1_ rc at : (4”) 
dk = Gk-1 Gk—-19k 


I 3. For two successive convergents of equal parity ae 
k-2 
and GE s k> 2) of the continued fraction (1), the relation 
Pr Pane _ (—1 4 Babe: br- 





Qk Qr- Qr-2Qx (6) 
holds true, 
Proof. We have 
Pr _ Pr-z__ DR 
Qk Qr-z Qr-:Qk. (7) 
where 
D, os P, vor 
Qe Qr-z 








whence, on the basis of the law of formation of convergents and 
on the basis of elementary properties of determinants, we obtain 


D, = |P r-1 T PnP as Pr-o|_ „ | Pan Pre 


Akr-1 HOr- Qr- P IQ- 1 Qr- 


where A, is the determinant considered in Theorem 2. By the co- 
rollary to Theorem 1, we have 


-1— (—1)* b,b,- si bk1 


= AkÂk-1 











whence 
i D, = (—1)* bib.. Dk-18p 
Consequently we obtain. formula (6) by using relation (7). 
Corollary. li a and z are two successive convergents of the 


64 Ch. 2, Some Facts from the Theory of Continued Fractions 


same parity of the simple continued fraction 


1 
a + 7 
Bie ae T 





then we have the relation 


Pr Pk-2 — ]\% , 
fk k-z ( Da fk- TET 69 
Theorem 4. Jf ali the elements of a finite continued fraction are 
positive, then its convergents of even order form a monotonic in- 
creasing sequence and the convergents of odd order form a monotonic 
decreasing sequence. Each convergent of even order is less than any 
convergent of odd order. The number a itself, which is expressed by 
the continued fraction, lies between two successive convergents. 


Proof. Suppose we have the continued fraction 
b b b 
om [as Bh eo » 


with positive elements a, and b, and ‘suppose a (k=0,1,...,7) 


are its successive canonical convergents. Then obviously P, >0 
and Q, >0. 3 
We consider two cases, 
1. Let k=2m be an even number. Then from (6), taking into 
consideration that a, >0 and 6;>0 (i=1, ..., k), we have 


Pan Panza o 


Qom Qom- 2 
Consequently 
Pom- 2 umn — 
gE pen- < i (m=1, 2, ...) 
OT 
P Peal ae (9) 
Qo `O Og 


2, Let k=2m-+1 be odd. Then k—1 will be even, and from 
the same relation (6) we get 


Pom-1 Pom+1 
Qom-1 Qom+1 


or 


Pr Psa Bs 
Qe ae Gee uo) 


2.3 Convergents 65 


We have thus preved that even convergents form a monotonic 
increasing sequence and odd convergents form a monotonic decrea- 
sing sequence (Fig. 1). 


fo k B Ps h 
Flg. 4 Do2 ne Ze A >; 


Furthermore, if in (4) we set k= 2m, we get 


Pom-1 Pom 
os Gyan” Cue oy 


which is to say that every convergent of odd order is greater than 
the adjacent convergent of even order. We conclude therefrom that 
any convergent of odd order is greater than any convergent of even 


order, Indeed, let ae be any odd convergent. If s<cm, then 
25-1 





Pes-1 Pon-ı ‘Pom 
Nae 0 Qom 
but if s>m, then 
Pos-a ~ Pas ~ Pom 
Oia Ran Oe 
And so for any s and m we have 
Posn1 Pom 
Qes -1 Qom 
Finally, from the law of formation of a continued fraction 
by 
be 
a,+. 


(12) 


a= ù + 


a+ 





we have the obvious relations 
. l Py 
a oO, ’ a < Qi > (e A) sve 


Hence 5 > 
Pr Pri 
F Qr PES Qk+i (13) 
if & is even and A p 
ER kti r 
Qk PER Qg+i (13) 
if k is odd. For the last convergent, we will clearly have an equ- 
ality on the right in place of the strict inequalities (13) and (13°). 


5 9616 


66 Ch, 2. Some Facts from the Theory of Continued Fractions 


Corollary 1. If the elements of the continued fraction (8) are 
positive and = are its convergents, then the following estimate 


Qk 

holds true: 

Pr biba.. pay _ 
Qg s QkQk +1 (14) 
Indeed, since, by what has been proved, we have 

Pr |Pr+1_ Pr 
Qe | ~1@n+1 Qk 
it follows, on the basis of (4’), that (14) holds true. 


Corollary 2. If the continued fraction « with positive elements 
is simple and are its successive convergents, then 


a 








|e— 





1 
TRI k+2 





[a2 < 
Gk 





Note. If all elements of a simple continued fraction are natu- 
ral, it may be demonstrated [1] that the convergent a is the best ap- 


proximation to the number a, that is, that all the remaining frac- 
tions a with denominator gq, deviate from œ more than the 
fraction 7 

a . 58 


Exampie 3. The second last convergent of zp is ay (see Examp- 
le 1). Therefore 
sas <S a © 9.001 
2.4 NONTERMINATING CONTINUED FRACTIONS 
Let 





mpa 


be a nonterminating continued fraction. We consider a segment, 
that is, a terminating continued fraction: 


[2:2 Bo is fa] = Fe (n=1, 2, 3, ...) (2) 


a,’ a,’ Qn 


2.4 Nonterminating continued fractions 67 


Definition. A nonterminating continued fraction (1) is called 
convergent if there exists a finite limit 


a © 
and the number œ is taken as the value of the fraction. But if no 
limit (3) exists, then the continued fraction (1) is termed divergent 
and no numerical value is assigned to it. 


By Cauchy’s test [3] for the convergence of a sequence z 


(n=1, 2, 3, ...) it is necessary and sufficient that there exist for 
every e >0 a number N= WN (e) such that 


Patm__ Pn 


Qn+m Qn 


for n>N and for any m>0. 
If Q0, then we obviously have 


Py Po , wx (Pe Pa- 
Bat ae) (4) 


LE 








From this 


. Px _ P = Py Prot _ Po = SE k-11 bibz.. Dp 4' 
TEm "a (gt) = Go da C Me Oe 





That is, the convergence of continued fraction (1) is equivalent to 
the convergence of the series (4'). If the continued fraction (1) 
converges: 
a= lim Z 
=o Qn 


then, by virtue of formulas (4) and (4’), we have 


P oe Pe Pee 2 bebe cB 
ea Te l É IE pa bet DE s| 
|a Z| “ ya Qr Qe-l © 2 r= 1Qk 
R=A+1 k=n+1 








Theorem 1. Jf all the elements a,, by (k=0, 1,...) of continued 
fraction (1) are positive, and 


bp <a, and a, zd >0 (k=l, 2, A) (5) 
then (1) converges. 


Proof. When proving the first part of Theorem 4 of the. preced- 
ing section we did not make use of the property of finiteness of a 
continued fraction. Therefore, repeating this proof, we establish the 
fact that if the elements of contiriued fraction (1) are positive, 


then its even convergents a (z=0, 1, 2, ...) form a monotonic 
2, 


68 _Ch. 2. Some Facts from the Theory of Continued Fractions 


increasing sequerice bounded from above (for example, by the num- 
ber at). Whence, by virtue of a familiar theorem, we conclude that 
t 


the limit 
lim Z= 


a 
k>% Qar 


exists. Similarly, under the conditions of our theorem odd conver- 


gents gan (k=0, 1, 2, ...) of the continued fraction (1) form a 
2. 


monotonic decreasing sequence bounded from below (by the num- 
Po 


g?’ for example). Hence there also exists 
0 


ber 


lim P2t+1 
kow Qk +1 


and B>>a. Besides, for any k 20 we have 


Por Pakri 
Qzk Ses hs Qok+1 
and so, using Theorem 2 of Sec, 2.3, we get 
P P biba.. b; 
<B— kti EBk 2172" + Vek+t 
i as a Or eri a O E (6) 


We shall show that n,—0 as k—+oo. Indeed, on the basis of 
the law of formation of convergents, we have, for k > 2, . 


Qr =ar- F bkk- 


Qr-1ı = Ak-1Qr-2 + 94-1 Qe_s 


Whence, by virtue of condition (5) of the theorem, we conclude 
that 


and 


Qk = b; (Q,-1+ Qr-:) 


Qr-1 2 IQg» 


and 


Hence 
Qi = ba (l +d) Qr- (7) 


From the inequality (7) we successively obtain 


Qor = bag (1 +d) Rak- => Bat 
= -> borbor- . -b, (1 +d) Q= b,b,.. Dy, (1 +d) (8) 
and ; 
Qorti Z Paksa (l +d) Qor-1 = nee 


$ + 2 Daryn 7 -b (1 + d)* Q, 5 bib. $ „barsi (d + d)* (9) 


2.4 Nonterminating continued fractions 69 
since Q, =a, >b, Multiplying together inequalities (8) and (9), 


we get 
Qatars = 6,b,. s Dean (1 +d 2k (10) 

and hence, 
biba. ..bogt l 
wS Gan Saya 
Thus, n — 0 as k— œ. 

And so, passing to the limit as k—oo in inequality (6), we 
have 0< p—ea <0, or 


a= B=lim Z 


and therefore the continued fraction (1) converges. 
Note. The value œ of the converging fraction (1) with positive. 


n=1 


elements lies between two successive convergents J and 0," 
n-i n 





Hence 
n 


Pal | Pa Pa- | biber -bn 
ja~a| < | ait |= "Sat 


Corollary, A simple continued fraction with natural elements 
always converges. 
The following theorem can also be proved 1]. 


Theorem 2. Every positive number a may be expanded in unique 
fashion into a simple converging continued fraction with natural 
elements. The resulting continued fraction is terminating if œ isa 
rational number, and nonterminating if œ is irrational. 


Example. Expand the number V 41 into a continued fraction and 
find its approximate value. 


Solution. Since the largest integer in V FI is 6, we have 











= 1 
| V4l=6+— (11) 
whence 
pelo _ 64 V4 
u Väe 5 
The largest integer in @, is 2, and so 
1 
a=2+7 (12) 
- whence 
E TEE EE A L (13) 


70 "Ch. 2. Some Facts from the Theory of Continued Fractions 


Similarly 
i ee es Al = le 
a= nog yy V=, (14) 
— 1 LEV 1 
Mga yale sO ae e 


We note that a,=a, and so the elements of the continued frac- 
tion will repeat, that is the continued fraction will be recurring. 
Substituting successively expressions (12), (13), (14), (15), ete.. 
into (11), we get i 


V T= 64 — _____ 
24a 
24 i 
l 


l 
ota a 


124 
2-4 


Thus, the irrational number 41 is expressed as a nonterminat- 
ing periodic continued fraction: ` 


FY NS 1 l1. 1 l 1 l 1 1 
Vai = (6; 9° 9» R?’ O° 2° 12? p?’ Y? J2? <3) 


an 


The convergents (k=0, 1, 2, ...) are found by means of the 


following scheme: 





Confining ourselves, say, to the fifth convergent, we obtain a 
major (too, large) approximation of V41: V ai = Se — 6.403125 


with absolute error less than 


1 


l 2 
A = 355 Can = 500-760 < 5:1079 


2.4 Nonterminating continued fractions 71 


Theorem 3 (Pringsheim). 7f for the nonterminating continued 


fraction : A . 
‘ „b n à g 
[oa B, say OB, = (16) 
the inequalities 
[b,|+1<la, | (tS 15 2h ees) (17) 


hold true, then this fraction converges, and its absolute value does 
not exceed n [4]. 


Proof. Let 7 E(k=1, ...) be convergents of the continued 
fraction wey : 
Since Qy =a;Qr -1+0 (R=1, 2, ...) 
it follows that _ 
1 Qe > larl] Qe-11—] brl] Qe-el 
Whence, using inequality (17), we get 
| Qe] > (bel +1)| Qril] be} | Real 


or 

| Qe || Qr- | lbr] (| Qe-l] Rr- (18) 

Applying inequality (18) successively and noting that Q,=1 and 
_,=0, we have 

|Q [— | Qr- > | bx |] br- ile dbl- (19) 


From inequality (19) it follows that | Q,| increases monotonically 
as k increases, and |Q,|>>|Q,|=1. 

The convergence of the continued fraction (16) is equivalent to 
the convergence of the series 





Po s Pe Pr-i\ __ . (—1)t-1bibz.. bp 
TTA (a ae Cente (20) 
Let us consider the series of the moduli 
— lóllbzl- lbr]. 
La = Oea lT] (21) 


On the basis of inequality (19) we have 


S [br 1 | be |---Lbe| AEEA 
L |Qz-1 ll Qe | 3 “1Qe-1 11 Ql 


k=l 











SAT A E EE E E E 
SLT at) =T TET STET! OSL 2.) 


72 Ch. 2. Some Facts from the Theory of Continued Fractions 


Thus, the partial sums of the series (21) are bounded and hence 
the series converges, and also 


Se ALBEE 
2 ce (22) 


But then the series (20) also converges (and converges absolutely) 
by virtue of the comparison test, that is to say there exists 


im Pee Y (Pe Pina) oe 
im gmd aa) 
Besides, taking into account inequality (22), we have 
laj <] 
Note 1. For the continued fraction (16) to converge, it is suffi- 
cient that inequality (17) hold for n >m; Q,~0 ior k<m. 


Note 2. Under the conditions of Theorem 3, we have the follow- 
ing estimate for the value of the continued fraction a: 


e-El< È 


k=n+1 


Qe Qr-1 








Pe Prema |= 


W 


= [bri]bol--Voel — o Llll _ 
= iat <2 ete 


k=n+1 k=n+1 


`: 1 1 loo y 1 
(Tost Tet) = Tot Te 
In particular, if |Q,]—+--0o as k—+oo, then 


Pale _! 
= 


|e- KA 


k=n+1 








_2.5 EXPANDING FUNCTIONS INTO CONTINUED FRACTIONS 


Continued fractions are a convenient way of representing and 
computing functions. For details see the special literature (for 
example [2]); we confine ourselves here to a selection of examples. 

Note that if a function f(x) can, by some technique, be expan- 
ded into a nonterminating continued fraction, then in the general 
case we must prove the convergence of the fraction and assure 
ourselves that the limiting value of the continued fraction is equal 
to the function f (x). 


2.5 Expanding functions into continued fractions 73 


A. EXPANDING A RATIONAL FUNCTION INTO A CONTINUED FRACTION 


If 
mttu tt +. 
F(®) Coo H Cor% + Cott +... 


then in the general case we have, after performing elementary 
manipulations, 


doca a 
F(x) ~~ Coo Coot Coix + Coat + +++ _ Coo "Cog + xfa (9) 
Cio Crp eyxterex*+... Cio 
where 
__ Capt Cat Ct 
hh (*) ~~ yp + 0y1x + Cyr? + 
and 
Cop = C100, k+1 7 Eoi, k+1 (k=0, 1, ...) 
Similarly 
= Cao 
h, (2) ~ C49 + kte (x) 
where 
Cso t Caix + Cat.. 
RO= Coo + C24 FCH ee 
and l 


Ck = Cooly, k41 1082, k +i (k =0, 1, .. ) 


and so forth, 


Thus 
C10 . Cio Czo% Cgo% Cnox | 
xX) = —— = Q — = reel ore esto, l 
F ) ¢ Czo% | * Cop’ Crp? Cgo’ ” Ca-1,0 ( ) 
00 saree Dal 
Cir Ee 


and it is easy to see that the continued fraction (1) is a termi- 
nating fraction. 

The coefficients c, of the expansions are conveniently computed 
successively by the formula- 


op — Cj-~e,0 Cm, k+ 
j 





Cji-1m0 Cj- k+l 


where j= 2, 


74 ' `Ch. 2. Some Facts from the Theory of Continued Fractions 


- Note that in some cases the coefficients c,, may prove equal to 
zero. Then appropriate changes must be introduced in the expan- 
sion (1) [2]. 


Example 1. Expand the function 
P l—x 
ood ae rem rs 
into a continued fraction. 
` Solution. Write down the coefficients c;, in the following scheme: 











Thus 

lx = fo: 1 —4e —2x —12x] | 1 
1—Set6x2 ~ |? 7? 7p? Pe ge i Ax ý 

Ta 2x 
-4+ 6x 

B, EXPANDING e* INTO A CONTINUED FRACTION 

For e* Euler obtained the expansion 

w Eo =pro ao e x? 
æ= fo; T ape 6 Wooo mm] (2) 


which converges for any value of x, real or complex {2]. 
From this we obtain the convergents 


noi 
Q «1? 
Py 24% 
Qa 2—x? 


P; 124+6x4x? 
Q, 12—6x-+x?’ 
P, _ 120+60x -12x3 +x? 
Qa  120—60x F 12x? —x8 
and so forth. | 
In particular, setting x= 1 and confining ourselves to the fourth 
convergent, we have 


ex I = 2.7183 e 


2.5 Expanding functions into continued fractions 75 


To obtain the same accuracy in the Maclaurin expansion 
l 1 
e=2 Tay + at vee 
we need at least eight terms. © 


C. EXPANDING tan x INTO A CONTINUED FRACTION 
For tanx, Lambert obtained the expansion 








Th. x — =x? —x 

tanx = [0; Toe) go ee eT be | (3) 

which converges at all points of continuity of the function, 
Example 2. Find tanl approximately. 


Solution. Setting x=1 in (3), we have 


tani= [o 7, =, | 


On the basis of formula (3) of Sec. 2.3 let us set up the follo- 
. wing scheme for the terms of convergents: 





Confining ourselves to the fourth convergent, we have 
tan 1 æ 2 = 1.557377 
(tables give tan 1 = 1.557396). 


76 Ch. 2. Some Facts from the Theory of Continued Fractions 


REFERENCES FOR CHAPTER 2 


[1] A. Ya. Khinchin, Continued Fractions, 1949, Chapter I (in Russian). 

[2] A. N. Khovansky, Application of Continued Fractions and Their Generaliza- 
tions to Problems of Approximate Analysis, 1956, Chapters I and H (in 
Russian). 

[3] G. M. Fikhtengolts, Principles of Mathematical Analysis, 1955, Vol. 1, 
Chapter IH (in Russian), 

[4] O. Perron, Die Lehre von den Kettenbriichen, 1929, Chapter VII. 


Chapter 3 
COMPUTING THE VALUES OF FUNCTIONS 


When using computing machines to evaluate functions given by 
formulas, the form in which the formula is written is not at all 
immaterial. Expressions which are mathematically equivalent are 
not of the same status as concerns approximate computations. This 
gives rise to a problem of practical importance, that of finding 
the most convenient analytic expressions for elementary functions. 
Computing the values of functions ordinarily reduces to a sequence 
of elementary arithmetic operations. Taking into account the restric- 
ted storage volume of any machine, it is desirable to split up the 
operations into repeating cycles. We now consider some typical 
techniques of computation. 


3.1 COMPUTING THE VALUES OF A POLYNOMIAL. 
HORNER’S SCHEME 
Suppose we have a polynomial of degree n, 
P (x)= a,x” + a,x" +. fess, ae Q) 


with real coefficients a,(k=0, 1, ..., n). Suppose it is required 
to find the value of this polynomial for x=: 


P(E)=a,€"+a,g"1+...+4,_,8+4, (2) 


The computation of P(&) is most conveniently performed as 
follows. Represent formula (2) as 


P(E) = (. . + (((45 +) E + 22) E +45) 8+... +4p-1)§ +4,) 


We then successively compute the numbers ° 


b, = ap» 

b, =a, + bo, 

b, T. Qg + BE, 

b, =a, +b, (3) 


b= Hbr- j 
and find 6, = P (Ẹ). 


78 Ch. 3. Computing the Values of Functions 


We will show that the numbers b,=a,, b, ..., bn- are coef- 
ficients of the polynomial Q (x) obtained as the quotient on the 
division of the given polynomial P(x) by the binomial x—&.. 
Indeed, let © 


Q (x) = Box" + Bixa oe + Boa (4) 
P (x) = Q (x) (x—§) +B, (5) 


On the basis of the remainder theorem the remainder is B,, = P ($). 
‘From formulas (4) and (5) we get 


P (x) = (Boe 1 + Byx?-? +... + Bai) (x —§) +B, 
or, removing brackets and collecting terms, 
P (x) = Box” + (By — BoE) x771 + (Ba— BE) x73 2. + 
+ (B,-1—Bn-25) x + (B, —B,-8) 


Comparing coefficients: of identical powers of the variable x in 
the piekt and left members of this equation, we find 


and 





By = ap, 

B,—B,§ =a, 

B,—-B,§ = a,, 

Bn-1— Bn- = Gnas 
B, — Bn- = 4 
whence 
| P= m =b, 
B, = & T Bos = bi, 
Era +Ê E= b, 
f= aa- + B25 = bais 
Ba = an +B,-18 = bn 


which completes the proof. 

Thus, formulas (3) enable us, without performing the operation 
of division, to determine the coefficients of the quotient Q (x), 
and also the remainder P (Ẹ). Practical computations are carried 
out in accord with a scheme called Horner’s scheme: . 


Qa, a, ... a, . E 


bE BE ... bp aÈ 
by ba b, we. =P 


3.1 Computing the values of a polynomial. Horner’s scheme _ 79 


Example 1. Compute the values of the polynomial 
P (x)= 3x? + 2x?—5x+7 for x=3 


Solution. Set up the Horner scheme: 


3 2 5 7 3 
F 9 33 84 a 


3 11 28 91—P(3) 


Note. Horner’s scheme permits us to find the bounds of, the real 
roots of the given polynomial’ P (x). 

Suppose that for x=B (B>0) all the coefficients b, in the 
Horner scheme are nonnegative, and the first coefficient is positive, 
that is, _ 

b= a, 0! and b0 (i=1, 2, ..., n) ~ (6) 


We can then assert that all the real roots x,(k=1, 2, ..., m; 
msn) of the polynomial P(x) are not situated to the right of B, 
that is, +, <B (k=1, 2, ..., m) (Fig. 2). 

LZ, Ta 


H sentient n a ta A SS ee can 
Fig. 2 ape G $ z 


Indeed, since . 
P (x) = (b47 4+... + bna) @—B) +8, 


it follows that for any x >P we have, by virtue of condition (6), 
P(x) >0, which is to say that any number exceeding P is definitely 
not a root of the polynomial P(x). We thus have an upper bound 
for the real roots x, of a polynomial. 

To obtain a lower bound of the roots. x,, form the polynomial 


(— 1)" P (—x) = ayx"—ayx"-? 4... +(—1)"a, 


For this new polynomial, we find a number x=a (œ > 0) such 
that all coefficients in Horner’s scheme are nonnegative. Then, by 
the earlier reasoning, for the real roots of the polynomial (—1)” P (—x), 
which are obviously equal to —x,(k=1, 2, ..., m), we have the 
inequality —x,<a@. 

Hence, %,52—a (k=1, 2, ..., m). We have thus obtained 
the lower bound —«a of the real roots of the polynomial P (x), 
whence it follows that all real roots of P(x) lie in the interval 
[—o, B] 


Example 2. Find the bounds of the real roots of the polynomial 
P (x) = x4*— 2x? + 3x? + 4x — 1l 


80 Ch. 3. Computing the Values ef Functions 


Solution. Compute the value of P (x) for, say, x= 2. Using Horner’s 
scheme, we have 
l —2 3 4 —1i [2 


+ 2096 2 7 
i 0 310 19 


Since all coefficients 6,520, the real roots x, of the polynomial 
P (x) (if they exist) satisfy the inequality x, < 2. The upper bound 
of the real roots is found. Now let us find the lower bound. Form 
a new polynomial: 

Q (x)= (— 1} P (—x) = xt 4 2x +- 3x3 —4x— 1 

Computing the value of Q (x) for, say, x=1, we have 

1 2 3—4 —l jl 
Toa ao a ee 
1 36 2 1 
All coefficients 6,>0, hence —x,< 1, that is, x, >—1. And 


so all i roots of the given polynomial! lie within the interval 
[—1, 2 


3.2 THE GENERALIZED HORNER SCHEME 
Suppose that in a given polynomial 
P (x) =a" +O, xT +... +O, (1) 


it is required for some reason to make a substitution of the vari- 
able x by the formula 
xy te - (2) 


where & is a fixed number and y is a new variable. 
Substitute expression (2) into (1) and perform the indicated 
operations, Collecting terms we get a new polynomial-in the va- 


triable y: E 
P (y +5) = Ay + Ay" H o H Ap (3) 


Since polynomial (3) may be regarded as a Taylor series expan- 
-sion in powers of y of the functions P (y4-Ẹ), the coefficients 


A; (i=0, 1, 2, ..., n) can be computed from the formula 
Piu-b ; 
A= (i=0, 1, , n) 
A more convenient practical method for finding the coefficients 
A, ((=0, 1, 2, ..., n) is by means of the Horner scheme. First 


put y=0 in (3). We then have A, = P ($). 
Dividing the polynomial (1) by ‘the binomial x—§, we get 


P (x) = (x—8) Py (x) +P &) (4) 


3.2 The Generalized Horner scheme . 8 


where 
P (x) = dx” F4 bx" 34 Oy 
Now if we put y=x—& in (3) in place of y, we obtain 
P (x) = (x—€) [A, (x—8)" 4 + A, (4 8)" P+. + An] +P (E) (5) 
Comparing formulas (4) and (5), we conclude that 
Py (x) = Ay ($) + Ay (x—§)" 2+... And (6) 


whenice 
A, -1= P, (6) (7) 
Similarly, dividing the polynomial P,(x) by the binomial x«—&, 
we can set 
P, (x) = (x—8) P, (x) + P, (8) (8) 


where P, (x) =c"? + ex3 nnn HC 
Besides, from (6) and (7) we have 


P, (x) = (x —§) [A, (E) 2+ A, (ETE. o o Anl HP © (9) 
Comparing (8) and (9) we conclude that 
P, (x)= Ay (x— $)? + Ay (x—§)? P+ H Anei 


whence A,_,= P, $). 
Continuing this process, we successively express all coefficients 


n=3 


A; (i=0, 1, ..., n) in terms of the values of the corresponding 
polynomials P,(x)=P(x) and P,(x), ..., P,(x)=a, for «=6: 
A, =P), 

A,-1=P, (), 

Ana a( , 

Ay = P, (8) 


where the polynomials P,4,(x) are constructed, proceeding from 
the polynomials P, (x), from the formula 


Py (2) =(—B) Per) +Pe®)  (R=0, 1, a., n) 


We use the generalized Horner scheme for computing the values 


of P ($), P, &), P,(), ret 


Ao y Q, oes n-i Qp Ba 
Boe b ... bn ab bn-ib 
b, b, b, ... ba- DnP G) 
A A nae ease 
Co Cy Cy Cn-1= P, G) 


[r 8 E ee © © ee ee 


R 9616 


82 Ch. 3. Computing the Values of Functions 


Example. In order to eliminate the term containing the third po- 
wer of the unknown in the polynomial 
P (x) = x? — 8x + 5x? + 2x —7 
set x=y+4 2. 
Find the transformed polynomial. 
Solution. Use the generalized Horner scheme: 
1-8 5 2—7 ý=2 
2—12 —l4 —24 


1-6 —7 —12 —31 


2 --8 —30 
| —4 —15 42 

2 4 
1 —2 —19 

= 


1 0 


l 


Hence a : 
P (y +2) = yt— 19y? —42y— 31 


3.3 COMPUTING THE VALUES OF RATIONAL FRACTIONS 


Every rational fraction R (x) may be represented in the form of 
a ratio of two polynomials, 


Rw = 7 (1) 





where . 
P (x) = ax” + ax" t+ ...+4,, 
Q (x) = bye" 4 bix” +... +5, 


Suppose it is required to find the value of R(x) at the point «=&: 
_ PE) 
; RO= gE 2) 


The numerator P(&) and the denominator Q (&) of (2) can be 
found by means of Horner’s scheme. This then yields a simple 
method for computing the number R (&). 

Another approach is to transform the function R(x) into a con- 
tinued fraction. To do this, use the technique described in Sec. 2.3. 


3.4 Approximating the sums of numerical series 83 


3.4 APPROXIMATING THE SUMS OF NUMERICAL SERIES 


Definition. The numerical series 
a tat... Hant. (1) 


is called convergent if there exists a limit of the sequence of its 
` partial sums: 
| S= lim S, (2) 
n> D 
where 
Sr= a +a +... Fa 


The number S is called the sum of the series. 

Thus, the convergence of series (1) is equivalent to the con- 
vergence of the sequence of its partial sums. According to Cauchy’s 
test [1] this sequence converges if and only if for an arbitrary 
e> 0 there is an N=N (e) such that 


|Sn+p Sn] <E 


for n> WN and an arbitrary p >O. 
From formula (2) we get 


S=S,+R, (3) 


where R, is the remainder of the series; R,—0 as n— oo. 

To find the sum S of the convergent series (1) to a specified 
accuracy g, choose the number of terms n sufficiently large so that 
the inéquality 


|R | <e 4) 


holds. Then the partial sum S, can, approximately, be taken for 
the exact sum S of the series (1). 

It will be noted that actually the terms Gy, Gy, ... are also 
determined approximately. Besides, the sum S, is ordinarily roun- 
ded off to a. given number of decimal places. To take all these 
errors into account and ensure the required accuracy, one can use 
the following procedure: in the general case, choose three positive 
numbers e,, €, and e, such that 


l EH Es + Ez 5E 
Take the number n of terms of the series so large that the resi- 
dual error |R„| satisfies the inequality 

|R Se : (5) 
Coinpute each of the terms a, (k=1, 2, ..., n) with a limiting 
absolute error not exceeding <2, and let a, (e=1, 2, ..., n} be 


84 Ch. 3. Computing the Values of Functions 


the corresponding approximate values of the terms of series (l), 
that is 


Jara |< 


Then for the sum 


Ms 


S, = rae 


satisfies the inequality > 


k 


the error of operation (summation 


D |S,—S, ESA (6) 


— 


Finally, round off the approximate result S, to the simpler num- 
ber S, so that the rounding error is 


15, —S,|<e, (7) 


Then the number 5, is an approximate value of the sum S of 
series (1) to the specified accuracy e. Indeed, from inequalities (5) 
to (7) we have 


|S—S, | <|S—S,|+]S,—5, ]-+]5,—S, | <e, +e, +e, =e 


The number e is partitioned into the positive parts e,, &, and Eş 
in accordance with the volume of work required to obtain the 


Fig. 3 g 


desired result. If e= 1077” and the result is to be correct to m 
decimal places, then one usually takes 
& € 


g 
2 nar B=: Lf an 


If there is no final rounding, then one ordinarily sets 


€ € 
nego ae as 


The problem becomes more complicated if one has to find the 


sum of a series correct, in the narrow meaning of the word, to m 
decimal places. Geometrically, this means that it is required to 


find the element of the set (£ an integer) closest to S (Fig. 3). 


sk 
107 


3.4 Approximating the sums of numerical series 85 
Suppose for definiteness the sum S is positive and 


S= po+ T ++. HR+.. + 


(where p, are nonnegative integers, næm) is a rational approxi- 
mation such that 
Pe 1 
[S—S|< ipaq 
Suppose that 
Pair %4 and Puss FS 


Then, rounding off the number S, we get the desired result: 





o= Pot ae Z+, ae TOM if Pmii S3 (8) 
or 
o=p +p te. +A if Poa 6 (8') 
Indeed, in the first case, when rounding down, we have 
0< S—o = fate + fate tie <pentipent-:-t 


+< "R 


In the second case, when rounding up, we get 


= 1 p p l 6 4 
o OS 0S = apr oS ++ o S Tow — 0n = aT 
si 
Therefore in both cases we have 
= 4 
[S—0| S n 


_ and so 
|S—0}<|S—S]+ |S—o] < pn t pasty 10" 
Thus 
S=o+t 4 . 1077 
Hf Pysr=4 OF Paii=5, One should increase the accuracy of the 


approximate sum S by taking another decimal place. 
In the particular case when p,,,,=4 and it is known that 


S<8 
then o from (8) is an approximate value of the sum S to within 
+ 107™ (too small). 


86 ch>3. Computing the Values of Functions 
SS 


Similarly, if p,.;=5 and 
S>S 

then o from (8’) is an approximate value of the sum S to within 
$ 1077 (too large). 

To estimate the remainder of the series (1), 

R= ana F neet 

it is useful to apply the following theorems which we give without 
proof [1] 

Theorem 1. Jf the terms of the series (1) are corresponding va- 
lues of a positive monotonic decreasing function {(x), that is, 


F a, = f (n) (a=1, 2, ser) (9) 
then (Fig. 4) 


W% 


| Fide <R, < |F de 


n+l 





yetta) 
EEP 


n+ g 






Fig. 4 





Theorem 2. // the series (1) is an alternating series: 
a,>0, a <0, a,>0, 
and the moduli of its terms decrease monotonically, then 
[Ral S lanl 
and 
sgn R, = SEN apy ” 
Example. Find the sum of the series 


ee Caras 1 | 
S=ptgtyt. tate Hor 
to within 0.001, 


D sen Ry (the signum KENDT) denotes the sign of the number Rw 
sgn R,=+1 if Ry > 0; senR,=—I if Rp < 0; sgnR,=0 if Ry= 


3.4 Approximating the sums of numerical series 87 


Solution. We take the residual error as 


l 


l che 
&e=7' 10 = p 


The terms of the series (10) are corresponding values of the mono- 
tonic decreasing function 


Fo) g 


And so for the nth remainder of the series 


Ms 


1 
E 
1 


R, = 


pa 
u 
= 

+ 


we have the estimate 


a 
ei 
IN 
aCe yg 
415 
ll 


Solving the inequality 


I 1 
on S 05 
we get 
n >V 2000 ~ 44.7 
We take n= 45. 


We choose the limiting error of the summation 4s 


é,= 1078 


whence the permissible limiting absolute error of the terms of the 
partial sum S,, of series (10) is 


1 
-—» 10-3 
& — 4 => .10-5 
nr= 45 -9 10 
Set 
eet 75 
as 10 


That is to say, we will compute the terms of series (10) correct, 
in the narrow sense, to five decimal places. Following are the cor- 
responding values of the terms and the results of partial summation: - 


88 Ch. 3. Computing the Yalues of Functions 


1.00000 0.00024 0.00003 
0.12500 0.00020 0.00003 
0.03704 0.00017 0.00003 
0.01562 0.00014 0.00003 
0.00800 0.00012 0.00002 
0.00463 0.00011 0.00002 
0.00292 0.00009 0.00002 
0.00195 0.00008 0.00002 
0.00137 0.00007 0.00002 
0.00100 0.00006 0.00002 
0.00075 0.00006 0.00001 
0.00058 0.00005 0.00001 
0.00046 0.00004 0.00001 
0.00036 0.00004 0.00001 
0.00030 0.00004 0.00001 


` 1.19998 0.00151 0.00029 
Thus, i 
S45 = 1.19998 + 0.00151 + 0.00029 = 1.20178 


Rounding this value to thousandths, we get the approximate value 
of the sum: 


S ~ 1.202 
Since the rounding error 


e, = 0.00022 < 4-107? 


the total error of the result does not exceed 
Lapi: 10-27 107° < Ž.10-? 

Thus, . 

S= 1.202 + 0.001 


A more accurate estimate is obtained if the signs due to rounding 
are taken into account. By way of comparison we give the value 


of the sum Sto within + - 107° [2]: 
S= 1.202057 


Note. Since calculating the total error is an extremely laborious 
procedure, the practical approach is this: to ensure a given accu- 
racy of e=10~7, all intermediate computations are carried out 


3.5 Computing the values of an analytic function 89 


with one or two extra digits. In this process, it is assumed (not 
quite strictly) that the errors involved do not affect the mth-order 
decimals of the sought-for result. 

It will be noted that in working this example we had to find 
the sum of a comparatively large number of summands. In practi- 
cal work, the attempt is made first to transform the series so that 
the desired result is obtainable with a smaller number of terms. 
A transformation of this kind is known as accelerating the conver- 
gence of the series and in many cases affords a great saving in 
computational time. This problem is considered in Chapter 6. 


3.5 COMPUTING THE VALUES OF AN ANALYTIC FUNCTION 
A real function f(x) is called analytic at a point & if in some 
neighbourhood |x—&|< R of this point the function is expansible 
in a power series (7 aylor’s series): 
Fo =fO@tl ©G—H+ GP y+... 
n) 
ME e .., (1) 
For &=0 we get Maclaurin’s series 
E , FP) 2 f™ (0) n 9 
FHO=FKO+F OE ee. FEO, Q) 


nl 
The difference 








R, =F )— DAS eye 


is called the remainder term and is the error resulting from the 
replacement of the function f(x) by the Taylor polynomial 


P, =È EP oo 


As is known [1], 





(71-41) O(x— 

R, (9) = ED (egy (3) 

where 0 << 6 < 1. In particular, for Maclaurin’s series (2) we have [1] 
(n+1) (9 

Ry (x) =P pn (4) 


where 0<.0<01. There are also other forms of remainder terms. 
In many cases, expanding a function in a Taylor series is a con- 
venient way to compute’the values of the function. 


90 Ch, 3, Computing the Values of Functions 


If f (Ẹ) is known and it is required to find the value f (+h), 

where A is a “small correction”, then formula (1) is better written as 

FE+AM=FO+K Ort R et. EDR (5) 
where 


(41) Oh s 
R„(h)= ea "o è (0<0<1) 


Example. Approximate V 10. 
Solution, We have 


1 
V=VFFI=3(1+5)° (6) 
Setting 
l 
F(x)= (142)? 
we successively obtain 
1 
fM=s(l+a) 2, 
P(=— (14a) F, 


7 
PY (= — Rta)? 


Whence, taking §=0, h=% and noting that 


FO=1, FO=+, F=}, O=? 


we find, by virtue of formula (5), 
1 


1\2 1 1 1 1, 1 ] 
(1+5) =l+g sg ati mt R= 
= 1 -+ 0.05556— 0.00154 + 0.00009 -+ R, = 


= 1.05411 +R, (7) 
where 
7 7 
1 15 @\"2 1 10 @\ 2 
R= ye (1+5) -a= teas (1 +5) 


(0<6< 1) 


3.6 Computing the values of exponential functions 91 


Obviously 
10 z 
IR,1< 7,680,616 < 6-10°° 
From formulas (6) and (7) we get 
V 10 = 3.16233-+-E (8) 


where 
JE|<3-5-10-§43.6-10-8 = 3.3- 107° 


Rounding off the value obtained to four decimal places, we finally 
have 


V 10 = 3.1623 + 6-107? 
Compare this with the tabular value 
V 10=3.1622777... 


3.6 COMPUTING THE VALUES OF EXPONENTIAL FUNCTIONS 
For the exponential function e” we have the expansion [1] 


eal+et es 474... (1) 


whose interval of convergence is —o <x<+-00. The remainder 
term of the series (1) is of the form 
efx 


R (=p <e<1) (2) 


For large absolute values of x, the series (1) is computationally 
inconvenient. The ordinary procedure is therefore as follows: let _ 


x=E (x)+q 
where E(x) is the largest integer in the number x and O0<q<1 
is the fractional part of the number. We have 
e” == eE (*) 99 (3) 
The first factor of the product (3) may be found by multiplying: 
E (x) times 
i 
CP Se are E if E(x) >0 


or 
=E (x) times 


1 1 1 
east L EHO 


92 Ch. 3. Computing the Vajues of Functions 


where 
e = 2.718281828459045.,.. 
and 


i= 0.367879441171442... 


Here, in order to ensure the specified accuracy, e or + should be 


taken to a sufficiently large number of decimal places (at the pre- 
sent time the number e has been computed to over 250 decimals). 

As for the second factor e% of the product (3), one can compute 
it by the expansion given above: 


A (4) 


n 


e1 = 


pis 


n 


which for 0<q<1 forms a rapidly converging series, since, on 
the basis of formula (2), we have for the remainder term R, (q) 
the estimate 


OR M< gT” 


Let us now derive a more exact formula for estimating the re- 
mainder R, (q) for 0< q< 1. We have 
grt? qrt+s 
R(= tE II EFH 
grr [+z E A ee 
“ar! n+2 © (n+2)(n+3) 


or [i + rath) +t ni | 


From this, by summing the geometric progression in square bra- 
ckets, we obtain 


< 


y l 


Thap? 
or; for 0< q <1, noting that 


n+2 atl 
n+l < n 





we finally get 


qret 
0< R, (9) < Se 





3.6 Computing the values of exponential functions 93 


or 
0<R, (9) < 4-2 (6) 


ted 
where t= is the last retained term. 


If the residual error e is given, the necessary number of terms n 
may be found by inspection when solving the inequality 

gti 

nin 





<e 


It is convenient to approximate e* for small x by formula (1) 
using the scheme 


. e” = Ug tH U tut... +4, +R, (x) (7) 
where 
u=l, u= (k=l, 2, ..., n) (8) 


On computers, the calculations are conveniently carried out 
` according to the scheme 
Up =F Ugi 
Sk = Sp- H Uy (k=0, 1, 2, ..., n) 
n 
where y= 1l, s_,==0, s,=1. The number s, = x approxima- 
k=0 “7 

tely yields the desired value of e”. 

If e is the given permissible residual error and n >>2|x|>0, 
then the summation process should be terminated. as soon as the 
inequality 











epee i 
IR IR, 2) < LEH. < 
2| x]”+1 a | 
x|” x x n 
5 E (n-+1)! = n+l.” ni <]u,j}<e 
is fulfilled, that is, if 
[u, (i<e (9) 


In other words, the summation process is terminated if the last 
generated term u, does not exceed e in modulus, and 


[Rn (x) |<] Hn | 


To compute the total error, use the general scheme given in 
Sec. 3.4 


Example 1. Find Ye to within 1075, 


94 Ch. 3. Computing the Values of Functions 


Solution. Assume the residual error to be 
-e= 4 1075 =2.5- 1078 
Since, as a rough guess, the number of terms in the sum (7) will 
then be of the order of ten, we compute the terms to within 
51077, that is to say, to two decimal places. 





Setting 
w=], u, = “Fe (k=l, 2, vee) 
we successively have l 
u= ] ) 
u, =~ = 0.5000000 
u, = Ẹ = 0.1250000 
us = “2 = 0.0208333 
u, = $ = 0.0026042 
u, = +4 = 0.0002604 
u, = 74 = 0.0000217 
u 
u, = 72 = 0.00000 16 j 
1.6487212 
Rounding off the sum to five decimal places, we get 
V e= 1.64872 (10) 


with total error 
e< 16-107645-5 1077-4 1.2- 107° = 3.05: 107° < 4-1075 


That is, correct (in the narrow sense) to all the digits given in 
the result (10). 

We can also compute e* by expansion into a continued frac- 
tion [4]: 


| —2x xt x? x? 
e= (0+, 4, BS’ m cr wo | (11) 


which converges for any value of x (real or complex). 
Example 2. Find Ve by means of formula (11). ae 


3.7 Computing the values of a logarithmic function 95 


Solution. Setting x= 5 in formula (11), construct a table of con- 


vergents of the corresponding continued fraction. 











1 l l 
5 
ap ie | 1 | 1 : > 6 | 10 | 14 
5 6l 1225 34361 
Pe z eed ° a= | z F | E | T6 
. 3 37 743 | 20841 
f 2 4 8 “T67 


Stopping, with the fifth convergent, we have 
So OP 34361 20841 34361 
Vew g= ie Te = mpeg = 1.648721 
to within 5:107. 
The iotlewine formula may be used to compute the values of 
the exponential function a* (a > 0): 


In? a si 


a = l] + (Ina). x + ge ta +e 


3.7 COMPUTING THE VALUES OF A LOGARITHMIC FUNCTION 


For the natural logarithms of numbers close to unity, the ex- 
pansion [1] 


in(l--x)o2—542— Fees, 
ett. (lec) (1) 


holds true. Formula (1) is hardly suitable for computations since 
the range of numbers 0< 1+*<2 is small and, besides, for |x| 
close to unity, the series (1) converges slowly. 
Let us introduce a more convenient formula for computing the 
natural logarithms of numbers. Replacing x in (1) by —-x, we have 
x? 3 4 n 
x x x (2) 


in Sei F ree Tee 


96 Ch. 3. Computing the Values of Functions 


Subtracting (2) from (1) term by term, we obtain 





1—x AS xP l 
In Ges («+ 545+...) 
Setting 
l—-x 
Tex? 
we get 
iez 
A= 
and, hence, 
I—z , 1 /1—zì\3 , 1 f1—z\s5 
mz=— +a (1) tel) te] © 
for 0< z< +00. 
Let x be a positive number. We represent it as 
i x=2".z 


where m is an integer and 5Ke< 1. Then, setting 


j—z 
Te 8 
where l 
0<§<—t=5 
l+y 


and using formula (3), we get 
Inx=In (272) =m In24+Inz=mIn2—2 (++. +2 5)-2 
where. 





Ep gant) 27 +3 gnb 
e ist) < 
n+ A 4 9 n+ 
< 2a HEHEH.) <p eg 


For O<E< = we have 


and so 


3.7 Computing the values of a logarithmic junction 97 


or, more crudely, 
J ] \22-21 
O<R, < 4 (Qn+ 1)" (3) 


Introducing the notation 


2k —1 
Wat (k= 1,2, ...) 


we get 

Inx=mIn2—2(u,4+u,+...+u,)—R, (5) 
where 

In2 = 0.69314718... 
The summation process terminates as soon as 
U, < 48 

where e is the permissible residual error, because then, by for- 
mula (4), we have 


9., E ! 
Ry ge get = 4 ee 





The limiting error of the sum De may be estimated by speci- 


fying a certain number of decal places in the summands and 


establishing, on the basis of (4), the approximate number of sumi- 
mands n. 


Example. Find In3 to within 1075. 
Solution. We will perform the computations with two additional 
digits. Putting ; 
B= 2-5 = 2.0.75 
we have z=0.75 and 


l—z 0.25 
t= H= =y.. 1428571 


We have 
= £ = 0.1428571 


u, = È- = 0.0009718 


of 


u, == = 0.00001 19 


OV 


u,===0. 0000002 
0.1438410 


7 9616 


98 Ch. 3. Computing the Values of Functions 


Using formula (5), we get 
In3 =2-0.69314718 —2-0.1438410 = 1.09861 


Note. lt is also possible to compute natural logarithms by pro- 
ceeding from the representation 


x=er2z 
where p is an integer and sue z< l (see [5]). 
To compute common logarithms, use the formula 


log,,x=MInx 
where 
M = log,, € = 0.43429448 1903252... 


3.8 COMPUTING THE VALUES OF TRIGONOMETRIC FUNCTIONS 


A. COMPUTING SINE AND COSINE VALUES 


Using reduction formulas it is possible to confine the argument 
x to the interval O<xcs. if ogs}. we have 


: ñ x?n+l 
sinz= 5 (—1) FIN (1) 
n=0 
but if < <5, then we set 
; ma = n 2r 
sin x= cosz = $) (—l) Dar (2) 
n=D 
where z= 5x and 0<: <5. 


The sumof the series (1) may conveniently be computed by the 
summation process 


sinx=u,+u,+...+4u,+R, (3) 


where the summands u, (k=1, 2, ..., n) are successively found 
by means of the recurrence relation 


u, =x, Urii = — Seep) Me 


(k=1, 2,...,a—1) 
Since the series (1) is an alternating series with terms monotoni- 
cally decreasing in modulus, for the remainder term R, the Sue 
mate 
IR, IS Gar Geen =| tnl 
olds true, and 
sen Ra = SONU, a4 


3.8 Computing the values of trigonometric functions 99 


And we can terminate the summation process as soon as we find 
that 
Ju, {<e 
where © is the specified residual error. 
Analogously . 


cosz=v,+0,+...+4u,+R, 


` 


where 
x? 
— RT) oe Ve 


Bed Spas (k=1,2,...,n—1) 
and : 
| Rn I< Sa = [Ural sgn Ry = sgn On +t 
Example, Find sin 20°30’ to within 1075. 
Solution. We have 


x= arc 20°30! = > 5 +30 = 0.349066 + 0.008727 = 0.357793 


Using formula (3), we obtain 








u= x =0,357793 \ 
u, = £4 = —0.007634 | 
u,= 32 “2 = +0.000049 
uy = om = —0.000000, s = —0000 | 
—0.350208 


whence 
sin 20°30’ = 0,35021 


The values of cosx are computed in a similar manner. 
B. COMPUTING TANGENTS 


We can take it that O0<e<i. For tan x, when Ix|< =, the 
following expansion [6] ne true: 
2x8 17x? 62x98 
tanga 2+ FT is + Bis TaT 


The coefficients of the expansion are “expressed: i in terms of Ber- 
nioulli numbers (see Sec. 16,12). 

The value of a tangent is very simply computed by means of 
continued fractions. Assuming 


tanx=— 
y 


100 Ch, 3. Computing the Values of Functions 


we have, by virtue of Lambert’s formula (Sec. 2.6), 


San 1; —xe x8 —x2 f 
y = Tar ea a eip E 225 


In order to compute y accurate to 1071° it is sufficient to take 
n=7. Then 


y= 1— 2: (3—2: (5—x2:(7— x: (9— 8? (11S 
— x : (13 —x?: 15)))))) (4) 


Ordinarily, y is computed with the aid of the Horner scheme 
for division (beginning at the end): 


= 13-- an 
Y: = 11— x: Y, 
Y; = 9— x? Yy, 
Ya = T— 2: Yy 
Y; = SHY, 
Ys = 3—2: Hos 


Y= h = 1—# iy, 
Whence tan r=, 
Exəmple. Find tan 40°. 


Solution. We have f 
x = arc 40° = 0.698132 























and 
From this, D i i 
y, = 13—88 L 12.967508, 
= 11 — eS = 10.962413, 
pm ODAT soe, 
n= 7a = 6.955577, 
n= 5— Tee = 4.929928, 
n= aG o oola, 
y=y,= I R = 0.832001 
and so K 
tan 40° = 2.098182 L 9 839100 
0.832001 


Ali digits are correct in this result. 


3.9 Computing the values of hyperbolic functions > 101 


3.9 COMPUTING THE VALUES OF HYPERBOLIC FUNCTIONS 


A. COMPUTING THE VALUES OF THE HYPERBOLIC SINE 
We know that 


X p= X 
sinh x = å £ 
2 
and 
sinh (—x) = — sinh x 


The following expansion holds for the hyperbolic sine: 
3 5 
sinhx=x+ 3 +5 + wee (00 <x < +œ) 


Assuming x > Q0, the computations are conveniently performed by 
the summation process: 


Sinhx=u,+u,+...+u,+R, 
where 
W=% u= grj (k= 1,2, ..., 21) 
and R, is the remainder term. For n >x >0 we have 


xenti Kent 3 ents 





Rn =@aphit Gap t+ Gp te Ss 
gene 1 x? xt 
Se e masa |< 
xen+l l 4. xenti 4 
mF O ALL <I 
nF) nF) 


Since, obviously, 


u a a gl 
n+ Dn (nl) “a SG Hn 


it follows that 


RS ath, 


B. COMPUTING THE VALUES OF THE HYPERBOLIC COSINE 


As we know, 
X -X 
cosh x = a 
and 


cosh (—x) = cosh x 


102 Ch. 3. Computing the Values of Functions 
The following expansion holds true for the hyperbolic cosine: 
cosh x=] eo ae ee {—œ << x¥< +00} 


The computations are conveniently carried out by the summation 
process 
coshx=v,+0,+...+0,+R, 


where 


0,1, Okr = OEE ee (k=1, 2, ..., n—l) 


and R, is the remainder term. For nœ |x| >0,we have 


xen xn +? enta 


Ra =m] T Gna ntal min 


< lt E 
(27)! at) Qnty! np apy < 
xen 1 4 n 4 

<a a 2 8 GM 8 
(2n + 1) (27 +2) 
Since for n> 1 the inequality 
x? , I 
Ont = Gay ane S 


f 
holds true, it follows that 7 


2 
Ra < 3 Ün 


C. COMPUTING THE HYPERBOLIC TANGENT 
We have 


sinh x ex —e-* 


tanh x = coshx ex te-* 





where 
tanh (— x) = — tanh x 


For ae small, the following expansion can be used to compute 
the value of the hyperbolic tangent: 


d X ax? 17x? | 62x8 n 
tanh x=x— 3 +75 35 83T (Ixi< z). 





For any x, the value of the hyperbolic tangent is expressed by 
a continued fraction: 


O E i x 
tanh x= [0; T? @ 5?’ PAR PEPE Eep 


3.10 Iteration method for approximating function values 103 


And since for x>0 the elements of this fraction are positive, 
tanh x, for x > 0, lies between two successive -convergents. 
Ii x>0O is great, then tanh x is conveniently computed by 
the formula 
2 
tanh x= l— aT 
3.10. USING THE METHOD OF ITERATION 
FOR APPROXIMATING THE VALUES OF A FUNCTION 


Let it be required to compute the value of the continuous 


function 
y=f(x) (1) 


for a given value of the argument x. lf the function (1) is com- 
plicated and one has to compute a large number of its values, the com- 
putations are ordinarily performed on computers. It may happen that 
direct computation of the values of the function by formula.(]) will be 
difficult due to the design features of the machine. The simplest 
operations may turn out to be “complicated” and even impossible 
to perform. For instance, there are calculating machines that do 
not perform division. In such cases the following technique is often 
useful. Write (1) in implicit form: 

F(x, yy=0 (2) 
We assume that F(x, y) is continuous and has a continuous par- 
tial derivative F, (x, y) 40. 


Let y, be an approximate value of y. Using Lagrange’s theo- 
rem, we have 


F (x, Yn) =F (x, Ya) — F (x, y) = (Ya —9) Fy (x, In) 
where y, is an intermediate value between y, and y, whence 
F (x, Yn) 3 
Fy % Yn) ” 
We do not know the value of y,. Assuming y, ~ Y,, we obtain 
the following iterative process [7] for computing the value of y: 


Ynti = Yn — Fy (x, Yn) (n 0, , yo OS, .) ( ) 


Formula (3) has a simple geometric meaning. Fix the value of 
x and consider the graph of the function 


z= F(x, y) (1) 


From formula (4) it follows that our process is the Newton 
method (see Sec. 4.5) applied to (4); that is, the successive appro- 


Y = Un 


104 Ch. 3. Computing the Values of Functions 


ximations y,,, are obtained as the abscissas of the points of in- 
tersection with the y-axis of the tangent to the curve (4) drawn 
at y=y,(n=0, 1, 2, ...) (Fig. 5). The convergence of the process 
is ensured if 

Fy (x, y) and Fy, (x, y) 


retain constant signs in the interval under consideration that con- 
tains the root of y. f 

Generally speaking, the initial value y, is arbitrary and is chosen as 
close as possible to the desired value of y. The process of iteration is 
z continued until, within the limits 
of the given accuracy e, two suc- 
cessive values y, and y,_, coin- 
cide: f y,_,—y, | < e. Strictly speak- 
ing, there is no guarantee here 
that 





For this reason, each concrete case 

requires an additional investigation. 

The merit of iterative processes 

is that the operations are of.the 

Fig. 5 same type and therefore are com- 
paratively easy to programme. 

It is worth noting that the representation F (x, y)=0 for the 
given function (1) may be realized in an infinitude of ways. This 
fact should be utilized in order to obtain a rapidly converging 
iteration process. Types of the basic processes are given in the 
sections which follow. 


3.11 COMPUTING RECIPROCALS 

1 

x’ 

For definiteness we assume that x > 0. Set 


Let y= 


F (x, y) =x—+=0 
Then i 
Fy (x, Daa 


Using formula (4) of Sec. 3.10, we have 


l 
a 
Yat =e 1 = 





2 
Yn 


3.44 Computing reciprocais 105 


ur : 
Ynsi=Yn(2—%Yn) (2 =0, I, 2, ...) (1) 


We have thus obtained an iterative process without division. The 
initial value y, is chosen in the following manner. Suppose the 
argument is written in binary: 


x= 2"x,, where m is an integer and TER <1 
Then set 
J= 2” (2) 
Let us examine the conditions of convergence of process (1). 
From formula (1) we have 
1 2 
oy, = 2p. 1EY n- =x (Stn. | (3) 
whence 


1 aig #4 an] 7 
Ly, =x?" Cer) = (1 — xy)? (4) 


For the convergence of process (4), it is necessary and sufficient 
that the inequality 


[l—xyo| <1 
hold or that 
— 1< l—xy <1 
Thus, we finally -get the following result: if 
O< xy). < 9 (5) 
then 
lim y,=— 


Note that for our choice of y, in (2), we have 
XY, = 2%, 2°" =X, 
and so 
7S <1 (6) 


and hence condition (5) is met. Besides, we conclude from formula (3) 


that 
<i (4)"<%(4)" 


That is to say, the convergence of the iteration process is extre- 
mely rapid. 

Let us derive a different estimate of the error in the value of 
y, Which is sometimes of more practical convenience. Note first 
of all that the successive approximations Yọ, Y Ya, ... are ob- 





I) -y, 


106 Ch. 3. Computing the Values of Functions 


tained in the given case by Newton’s method as applied to the 
hyperbola 


zx (x= const) 
(Fig. 6). From the inequality (6) and formula (3) it follows that 
O0<y,<— (n=0, 1, 2,...) 
Besides, since 
Z=2-% Yn — Yn-1= Yn- (I — Yn) = 
1 

=XYn -1 (~—w-1) = 0 (7) 
it follows that the successive appro- 
ximations of y, increase monoto- 
nically: 

Yo SIs Sy S... 


From formula (7) we have 





Fig. 6 


1 1 

T Yn- 5g, Ya Ina) 
or, since 

l 


XYn -1 = Yo ZF 


it follows that 


ET <= 2 (Yn ann Yn 1) 
Whence 


1 
ah Sh Yn -1 
Thus, if it is found that y,—y,_,<, then the true error too is 
0< ot <e 
Example. Using (1) find the value of the function y=— for x= 3. 
Solution. Here, x= 2?-—. Putting y,=—7, we have 
] 3 5 
Y, = 0.312 (2—3-0.312) = 0.332, ete. _ 
The iteration process converges rapidly. 


3.12 Computing square roots 107 


3.12 COMPUTING SQUARE ROOTS 





Let 
ý y=Vx  (x>0) (1) 
Set 
Fix, y=y—x=0 
and then 
Fy (x, y)=2y 
Using formula (4) of Sec. 3.10, we have 
Yn = In — 5 
or 
j} 
, I= (H +2) (2) 


(n=0, 1, 2, ...) which is Hero’s process (Hero’s algorithm). 

The successive approximations y,, y,, y¥,,... are clearly obtained 
by Newton’s method applied 
to the parabola 


z=y?—x (x= const) 
(Fig. 7). 
Note that if for y, we take 
the tabular value, which gives 


V xwith a relative error [5], 
then y, determined from (2) 


will yield the value of Vx 
with, approximately, the re- 


lative error 6. 
Indeed, setting 
y=V x (1+) 


and neglecting powers of 6 Fis. 7 
above the third, we have 


l l = 5 
n=z(n+)=7 [VEU +8) +VE0 +] = 
= = 2 
=5Vx(1+64+1—848)=Vx (1+5) 
From this we draw the following important conclusion: when Hero’s 


process is used, the number of correct digits is roughly doubled at 
each step compared to the original number of correct digits. 


a Ca 





168 Ch. 3. Computing the Values of Functions 


Example 1. For y=}? we have approximately 
y= 1.4 


Making this value more precise, we obtain 
m=z (1447) =07 +0.714= 1414 


Repeating the process, we get 





p= ( 14144535 } = 0.707 + 0.7072136 = 1.4142136 


correct to eight or seven decimal places. Indeed, 


V 2=1.41421356... 


Let us examine the conditions of convergence of the Hero process. 
From formula (2), replacing n+ 1 by a, we have, for y,0, 


"E ] —\ p 
Yn —V x = gy Wn- V 2) 


od 

















and 
Ss 1 £ 
Yn +V x = 57 ln +V x)" 
whence 
to V2 (maa) (3) 
YntV x \Yn-artV x 
Consequently 
Yn— Vx penny 6 ao ra” 
YytVx \w+V x 
and 
Y, —V x= QV x . i En (4) 
—q 
where 
_ Yom Vx ; 
ANa a 
From formula (4) it follows that the Hero process converges for 
lg] <1 
that is, if 
Y > 0 


In this case we clearly have 


lim y,=Vx 


=> © 


3.12 Computing square roots : 109 


also 7 
YeVx (n=1,2,...) 
Note that 
= l * aay 
Yn-1 Yn = Yn -1 9 (Ya Ž)= yn >0 (6) 
and so the approximations y, for n> 1 form a monotonic decrea- 
sing sequence 
NERE Yna h -VX 


(Equality can only occur when nae a 


When doing the computation on a computer, it is convenient 
to wile the number x in binary: x =27x,, where m is an integer 





and ae a Then for the zero approximation one ordinarily 
takes 


where E (+) denotes the greatest integer in the number >: 
Example 2. Find V 5. 


Solution. Here x= 5 = 23 me and so — 
jaa as 
Using formula (2) we successively find 
th ae 
y= z (2 25 + yop) = y (2-25 + 2.2222) = 2.2361 
and so on. Using square-root tables we have 


V 5 = 2.236068.. 


Let us estimate the quantity igl DRN by formula (5) by 
proceeding from the value of y, as determined by (7). 
If m=?2p is even, then 
E( 


Yo=2 T) aos Vx 


and, -hence, 

1 
Yo— Vix _ 2P~2? Vg_1-V i — -y7 
WEVE PPV raS ore 





lq= 2 =(V2—1)° 


110 Ch. 3. Computing the Values of Functions 
Analogously, if m=2p-+-1 is odd, then 


n= (+) = 2P <Vx 
Therefore 
Vii 2 Vigor Bq 1 _ 
Vry 2 V2, +2? V+ 


=]——Ż SP Wee et ey ay 
= (ao Vea va) 


We thus always have 
lal <(V2—1)? =0.1716... <4 





lqg|= 


Whence, on the basis of (4), we get 


1 \2" 
0<y,—Ve<2V au GY for nl 


is) 


where 
l 3 
nz (wre) < zh 
And from this, 
25 1 \2" 
O< V< ho (3) (8) 


From (8), it is easy to determine the number of iterations n =n (x) 
sufficient to ensure a given accuracy. 

We give one more formula for estimating the error in the value 
of y, (n= 2). Since 


Yn- > Vx and Tay * 
we have, taking into account (6), 


aS VES par A S (n-ta) 
HL Yn-1 
Hence : 
OLIVES Yn- tn (9) 

Thus, if O<y,_,—-y, < e(n 2) it is guaranteed that O<y,— 
= x <e, 

Another method for computing square roots that is sometimes 
useful is this. Replace the function (1) by the equivalent relation 


F(x, y) =z —1=0 


3.13 Computing the reciprocal of a square root 111 


Then 
2x 


F(x, aor 


Using formula (4) of Sec. 3.10, we get 


Ži 7 


Yn+i = Yn + Un 


Ya 





or 
2 
g= (3-2) (n=0, 1, 2 a.) (10) 


‘We will not consider the error estimate or the conditions for 
convergence of the iterative process (10). 


3.13 COMPUTING THE RECIPROCAL OF A SOUARE ROOT 
Set 


1 
ears (x eo 
Writing the function as 
"T 
=V 3 


we obtain from (10) of the preceding section an iterative process 
without division: 


Yny =F (3— 297) (n=0, l, 2, vee) (1) 


If x=2”x,, where er <1, then for y, choose the value 
A= 977 (4). 
It will be noted that by using the obvious equation 


aay 


it is possible, by virtue of formula (1), to extract the square root 
of a number without invoking the operation of division. 


112 Ch. 3. Computing the Values of Functions 


3.14 COMPUTING CUBE ROOTS 
li 


y=Vx (x >0) (1) 
then, putting 
Fix, y} =y —x=0 
we get 
Fi(%, y) = 3y 


whence, using formula (4) of Sec. 3.10, we obtain 





aa 
hah- E B 
Yn 
or f 
1 
na=g (mt) O 
Geometrically, process (3) is 
Newton’s method applied to the 
cubic parabola 
z=ył—x (x= const) 


(Fig. 8). The process (3) conver- 
ges for y, > 0. 

If for the initial approxima- 
tion y, we take the tabular va- 
lue of į x which has a relative 
error of |ô], that is, if we set 


w= x(1 +8) 


then the value of y, found from formula (3) will yield j/x with 
a relative error of 5%. Indeed, applying formula (3), we have 


= 3 (w+ H)= 5 [2 yY x(1+6)+ 0+8] 
= / x (2428 + 1—28 +38) = f/x (1 +8) 





Fig. 8 


From this we conclude, for one thing, that if y, is correct to p 
digits, in the narrow sense, then y, will be correct to 2p or 2p— 1 
digits, in the broad sense (cf. Sec: 3.12). 


Example. Using three-place tables, we have 
y 10 =2.154 
correct to all the digits, 


3.14 Computing cube roots 113 
Using formula (3), we get 


3 l 10 1 
/ T= 5 (2-2.154 +58) == (2-2.154 +2.155304) = 2.154435 


By way of comparison, Barlow’s tables give 
}/ 10 = 2.1544347... 


Ii x= 2"x,, where m is an integer and =< 1, then 
w=2 Fs 0 > CE 
is ordinarily taken for the initial value yọ. 
Since 
= I = 
Yn VX mg (20,1 a ee x) a 
n-i 
l —~\ 9 Xs 
= 3 ea a (2y,-1+ x) >0 
n-l 
it follows that 


Ynre Vx for nèl (5) 
Besides, from formula (2), replacing n-+1 by n, we have 
3 
ag o HI 
Yn-1 Yn 3y2_4 (6) 


and therefore 
bes Sy es ye (7) 
whence it follows that the limit 


lim y,=y >0 
exists. Passing to the limit in (3) as n — œ, we get 
bef x 
| y=5 (2y+ ž) 
That is, =x, hence, y= j/x. Thus 
lim y, =VX 


now 


H the initial approximation y, is chosen on the basis of formula 
(4), it can be proved that 


377 Oe 
0 <n, — VX SF Yn- Yn) 
for n2. 


8 9616 


114 Ch. 3. Computing the Values of Functions 


REFERENCES FOR CHAPTER 3 


[I] V. f. Smirnov, Course of Higher Mathematics, 1957, Vol. 1, Chapter IV 
(in Russian), ; : 

{2] A. Markov, Calculus of Finite Differences, 1911, Chapter HHL (in Russian). 

{3] G. P. Tolstov, Course of Mathematical Analysis, 1957, Vol. Il, Chapter XXIV 
(in Russian) 

4] A. N. Khovansky, Application of Continued Fractions and Their Generali- 
zations to Problems of Approximate Analysis, 1956, Chapter II (in Russian). 

[5] B. M. Kagan and T. M. Ter-Mikaelyan, Solution of Engineering Problems 
on Automatic Digital Computers, 1958, Chapter IHI (in Russian). 

[6] G. M. Fikhtengolts, Course of Differential and Integral Calculus, 1948, Vol. 11, 
Chapter XII (in Russian). 

[7] L. A. Lyusternik, A. A. Abramov, V. I. Shestakov, M. R. Shura-Bura, 
The Solution of Mathematical Problems on Automatic Digital Computers, 
1952 (in Russian). 


Chapter 4 


APPROXIMATE SOLUTIONS OF ALGEBRAIC 
AND TRANSCENDENTAL EQUATIONS . 


4.t ISOLATION OF ROOTS 


lif an algebraic or transcendental equation is fairly complicated, 
it is not usually possible to find exact roots. What is more, in 
certain cases the equation contains coefficients known only appro- 
ximately and, hence, the very aim of finding the exact roots of 
the equation is meaningless. Thérefore, methods for approximating 
the roots of an equation and estimating their degree of accuracy 
acquire particular importance. 

Suppose we have an equation 


f(x) =0 (1) 


where the function f(x) is defined and continuous on some finite 
or infinite interval a < x < b. 

In certain cases in the sequei we will need the existence and 
continuity of the first derivative f'(x) or even the second deriva- 
tive f (x). This will be appropriately stipulated where necessary. 

Every value £ for which the function f (x) is zero, that is, such that 


fE) =0 


is called a root of equation {1} or a zero of the function f(x). 

We will assume that equation (1) has only isolated roots, that 
is, for each root of (1) there is a neighbourhood which does not 
contain any other roots of the equation. 

Approximating the isolated real roots of (1) ordinarily consists 
of two stages: 

(1) isolating the roots, that is, establishing the smallest possible 
intervals [æ, B] containing one and only one root of equation (1); 

(2) improving the values of the approximate roots, that is, refi- 
ning them to the specified degree of accuracy. 

Very useful in the isolating of róots is the following familiar 
theorem of mathematical analysis ({5], Chapter 4). 


Theorem 4. If a continuous function f(x) assumes values of oppo- 
site sign at the endpoints of an interval {a, R], i.e., i (a) (Bp) < 0, 


116 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


then the interval will contain at least one root of the equation 
{(x)=0; in other words, there will be at least one number $€ (a, B)? 
such that {(§)=0 (Fig. 9). 

The root § will definitely be unique if ae derivative f'(x) 
exists and preserves sign within the interval (a, B); that is, if 
p (x) > 0 (or F(x) <0) for a<cx<B (Fig. i0). 


Y 





Fig. 9 Fig. 10 


The process of isolation of roots begins with establishing the 
signs of the function f (x) at the endpoints x=a and x=b of its . 
domain of existence. 

Then the signs of the function f(x) are determined at a number 
of intermediate points x=a,, Œ, . the choice of which takes 
into account the peculiarities ‘of the function f(x). If it turns out 
that f (a,) f (a,4,) < 0, then, by virtue of Theorem 1, there is a root 
of the equation f (x)=0 in the interval (a,, a,.,). We must make 
sure that this root is the only one. Practically speaking, it is 
often sufficient, for isolation of the roots, to o a halving 
process, approximately dividing the given interval (æ, B) into two, 
four, eight, etc., equal parts (up to a certain oo and to 
determine the signs of f(x) at the points of division. It is well 
to recall that an nth-degree algebraic equation 


ax®*tax"-4+.,,,ta,=0 (a, #0) 


has at most n real roots. Therefore, if for such an equation we 
obtain n+1 sign-changes, then all the roots of the equation have 
been isolated, 


Example 1. Isolate the roots of the equation 
j (x) =®—6x+2=0 (2) 


1) The notation E€(a, B) signifies that the point Ẹ belongs to the inter- 
val (a, B). 


4.1 isolation of roots 117 


Solution. Tabulate the findings roughly as follows: 














Thus, equation (2) has three real roots within the intervals 
(—3, —!), (0, 1) and (i, 3). 
If there is a continuous derivative f’(x) and the roots of the 
equation 
P(x) =0 


can readily be computed, the process of isolation of roots of equa- 
tion (1) may be regularized. Clearly, it is then sufficient to count 
only the signs of the function f(x) at the points of the zeros of 
its derivative and at the end points x=a and x=b. 
Example 2. Isolate the roots of the equation 
f(x) =x*—4x—1=0 (3) 


Solution. Here f’ (x)= 4(x?— 1) and so F (x)=0 for x=1. 

We have f(—o)>0(+); f(1)<O(—); f(+00)>0(+). 
Consequeritly, equation (3) has only two real roots, one of which 
lies in the interval (—oo, 1), and’the other lies in the interval 
(1, + 00). ; 


Example 3. Determine the number of real roots of the equation 
` P(e)=x+er=0 (4) 


Solution. Since f’ (x) = 1 + e* >0 and f (—co) = —oo, f (+00) = ++ %, 
it follows that (4) has only one real root. 
We give an estimate of the error of an approximate root. 


Theorem 2. Let § be an exact root and x an approximate root of the 
equation {{x)=0, both located in the same interval [a, B], and 
If (x) (=m, >0 for alx <p.” 

Then the following estimate holds true: 


[xp ) < (5) 


D For m, we can, for instance, take the least value of | f'(x)] when 
ax. 


118 Ch. 4, Approx. Solut’s of Algebr. and Transcendent. Eqs. 
Proof. Appiying the mean-value theorem, we have 
F (x)—F E) = (xE) F (e) 


where c is a value intermediate between x and £, that is c€ (a, p). 
From this, because f (Ẹ)=0 and |/’ (c) | =m, we get 


KOLORO | m, |x—§ | 
Hence 


Note. Formula (5) may yield rough results and so is not always 
convenient to use. For this reason, in practical situations it is 
common to narrow in some way the general interval (œ, B) that 
contains the root § and its approximate value x, and to assume 
|x—§/ <B—«. 


Example 4. As an approximate root of the equation f (x) = x* — 


—x—1=0 we have x=:1.22. Estimate the absoiute error in this 
root. 


Solution, We have f (x)= 2.2153—-1.22—1 = — 0.0047, 
Since for x= 1.23 we get, 


F (x) =2.2888— 1.23— 1 = + 0.0588 


the exact root € lies in the interval (1.22, 1.23). The derivative 
F (x)= 3x'—1 increases monotonically, and so its smallest value 
in the given interval is 


m, = 3-1,22—| =3-1,816—1 = 4.448 
whence, by formula (5), we get 


0.0047 


|x—§|< fae 


zæ 0.001 

Note. Occasionally, in practical situations the accuracy of an 
approximate root x is estimated by how well it satisfies the given 
equation /(x)=0; that is, if the number |f(x)| is small, then x is 
assumed to be a good approximation to the exact root & but if 


if (x)| is great, then x is taken to be a rough value of the exact 
root §. As Figs. 11 and 12 show, this approach is erroneous. One 


4.2 Graphical solution of equations 119 


should likewise not forget that if the e juation f(x)=0 is multi- 
plied by an arbitrary number N 0, then we obtain an equivalent 
equation Nf (x)=0, and the number | Nf(x)| may be made arbi- 
trarily large or small by an appropriate choice of the factor N. 


g 








Fig. 14 Fig. 12 


4.2 GRAPHICAL SOLUTION OF EQUATIONS 
The real roots of the equation 
f(x) =0 (1) 


can be determined approximately as the abscissas of the points 
of intersection of the graph of the function y=f (x) with the x-axis 
(Fig. 9). If equation (1) does not have nearly equal roots, this 
method can readily be used to isolate them. It is often advisable 
to replace equation (1) by an equivalent equation (two equations 
are termed equivalent if they have exactly the same roots): 


p (x)= tp (x) (2) 


where the functions @(x) and p(x) are simpler than f(x). Then 
we construct the graphs of the functions y = ọ (x) and y= p(x), and 
the desired roots are ebtained as the abscissas of the points oi 
intersection of these graphs. 


Example 1. Solve the following equation graphically 

x log,,x=1 (3) 
Solution. Write equation (3) as 

log, x= + 


The roots of (3) can clearly be found as the abscissas of the 
points of intersection of the logarithmic curve y= log,,x and the 


420 Ch. 4. Approx. Solut’s of Algebr. and transcendent. Eqs. 


hyperbola y=. Constructing these curves (Fig. 13) on coordinate 


paper, we get an approximate value of the sole root £ = 2.5 of 
equation (3). >- 






$2109 92 


Fig. 13 


Finding the roots of equation (2) is simplified if one of the 
functions @ (x) or (x) is linear, say p(x)—ax+0. Then the roots 
of (2) are found as the abscissas of the points of intersection of 
the curve y= p(x) and the straight line y=ax-+-b. This device js 
particularly advantageous 
when solving a series of 
equations of the ‘same type 
which differ solely in the 
coefficients a and b of a 
linear function. Here the graph- 
ical construction reduces to 
finding the points of intersec- 
tion of a fixed graph, y= p(x), 
and various straight lines. 
This type obviously includes 
the three-term equations 


x* +ax+b=0 
Example 2. Solve the cubic 
equations 


x®—1.75x+0.75=0 
and 





84+ 2x+78=—0 


Solution. Construct the cubic 
parabola y= x. The desired 
roots are found as the abscis- 
Fig. 14 sas of the points of inter- 





4.3 Halving method 124 - 


section of this parabola by the straight lines (Fig. 14) y= 
=1.75x— 0.75 and y=— 2x— 7.8. It is clear from the figure that 
the first equation has three real roots: x,=—1.5, x,=0.5,x,=1, 
and the second equation has only one real root, x,=— 1.65. 

It should be noted that although graphical methods of solving 
equations are very convenient and comparatively simple, they are 
as a rule used only for a rough determination of the roots. With 
respect to loss of accuracy, a particularly unfavourable case is 
when the lines intersect at a very acute angle and practically merge 
along a certain arc. 

One variety of the graphical methods of solving equations are 
nomographic methods, which the reader will find in specialized 
manuals. 


4.3 THE HALVING METHOD 
Suppose we have an equation 


F(x)=0 (1) 


where the function f(x) is continuous on [a, b] and f(a) f(b) <0. 
In order to find a root of (1) lying in the interval [a, 0], divide 


the interval in half. If H) =, then got is a root of the 








equation. If (+) <0, then we choose that half, fa, ad or 
ae b| , at the endpoints of which the function f(x) has oppo- 


site signs. The newly reduced interval [a,, b] is again halved and 
the same investigation is made, etc. Finally, at some stage in the 
process we get either the exact root of (1) or an infinite sequence 





of nested intervals [a,, bi], [@,, 5], .--, [an On], ... such that 
Fan) FU)<0 (n=, 2,...) (2) 
and 
by — Ay = 55 (b— 0) (3) 
Since the left endpoints a,, a,,...,a,,... form a monotonic 
nondecreasing bounded sequence, and the right endpoints bis Das eos 
5,, -.. form a monotonic nonincreasing bounded sequence, then 


by (3) there exists a common limit 
t= lima,= lim b, 
Passing to the limit in inequality (2) as yt we get, by 
virtue of the continuity of the function f(x), [F (§)]?<0, whence 
f(&)=0, which means that € is a root of equation (1); it is obvi- 


122 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


ous that 
0<t—a, <x (b—a) (4) 


If the roots of equation (1) are not isolated in the interval [a, b], 
then this device can be used to find one of the réots of (1). 

The method of halving is conveniently used in rough approxi- 
mations of the root of the given equation since the amount of 
computation increases substantially with increase in accuracy. 

The halving method, by the way, is well suited to electronic 
computers. The computational routine is set up so that the ma- 
chine finds the value of the right member of ee (1) at the 
midpoint of each of the intervals [a,, b,] (n=1, 2, ...} and chooses 
the appropriate half. 


Example. Use the halving method to improve a root of the equa- 
tion 


F(x) = x48 + 2x8 —~x—1 =0 
lying in the interval [0, 1]. 
Selution. We have successively 
F(O)=—1, /()=1, 
f (0.5) =0.06 + 0.25 —0.5—1=— 1.19, 
f (9,75) = 0.32 + 0.84 0.75 —1 = — 0.59, 
f (0.875}.= 0:59 + 1.34 —0.88—1 = + 0.05, 
f (0.8125) = 0.436 + 1.072 —0.812— 1 = — 0.304, 


F (0.8438) = 0.507 ++ 1.202 —0.844 — 1 = — 0.135, 
f (0.8594) = 0.546 + 1.270 — 0.859 — I = — 0.043, etc. 


We can take 
E= 4 (0.859 -+ 0.875) = 0.867 


4.4 THE METHOD OF PROPORTIONAL PARTS 
[METHOD OF CHORDS) 


Using the assumptions of Sec. 4.3, we give a faster method for 
finding a root & of the ae pa) lying in a specified inter- 
val [a, b] such that f (a) f (6) 

For the sake of definiteness, oa F(a) <0 and F(b) >0. Then 
instead of halving the interval ja, b] it is more natural to divide 
it in the ratio —f(a):/(b). This yields an approximate value of the 
root 


x, =a+h, l (1) 


4.4 Methed of proportional parts (method of chords) 123 | 


where 


— f(a) f(a) 

= Fa tre O A= rp rA (2) 

Them, applying this device to the interval [a, x,] or [x,, b] at 

the endpoints of which the function f(x) has opposite signs, we get a 
second approximation to the root x,, etc. 

Geometrically, the method of proportional parts is equivalent to 

replacing the curve y=f(x) by a chord that passes through the 


Fig. 15 





points A Ja, f(a)] and B fb, f(b)] (Fig. 15). Indeed, the equation 
of the chord AB is 

xa yt (a) 

b—a f(b) —F (a) 


Whence, assuming x=x, and y=0, we get 


a eee ey | oe r 
= O— Fay (O--4) ee 


Formula (1’) is completely equivalent to formulas (1) and (2), 

To prove the convergence of the process, let us assume that the 
root is isolated and the second derivative f” (x) has a constant sign 
on the interval [a, 6]. 

Suppose, for definiteness, that f"(x) > 0 for axx <b (the case 
-F (x) <0 reduces to our case if we write the equation as 
—f (x)= 0). Then the curve y =f (x) will be convex down and, hence, 
will be located below its chord AB. Two cases are possible: (1) 
f{a}> 0 (Fig. 16) and (2) f(a) <0 (Fig. 17). 

In the former case, the endpoint a is fixed and the successive 
approximations: x,=9, 


; Í (Xn) - f 
*n+1=%n — Fo) —F (a) (X,—@) (n=0, l, 2, e.) (3) 





124 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


form a bounded decreasing monotonic sequence, and 
OS BA ake Spey E A SO 





In the latter case, the endpoint b is fixed and the successive 
approximations: x,=a, 
Und b) (4) 


Ynti = En — F 6) — f Gin) 
form a bounded increasing monotonic sequence, and 
RW, Se 4 a R Se aae E EED 


Summarizing, we conclude that: (1) the fixed endpoint is that 
one for which the sign of the function f(x) coincides with the sign 
of its second derivative f" (x); (2) the successive approximations x, 
lie on the side of the root € where the sign of the function f (x) 
is opposite to the sign of its second derivative f” (x). In both cases, 
each successive approximation x,,, is closer to the root € than the 
preceding one, x,. Suppose 

E= limx, (a< <b) 
{the limit exists since the sequence {x,} is bounded and monoto- 
nic). Passing to the limit in (3), we have for the first case 


ELE i®) g 

= aneia —a 

T F&F (a) ae 
whence /(&)=0. Since it is given that the equation f(x)=0 has 
only one root é on the interval (a, b), it follows that = £, which 


completes the proof. 


4.4 Method of proportional parts (method of chords) . 125 


By means of the very same limit. process in (4), it can be 
proved that §=€ for the second case. 


For an estimate of the accuracy of the approximation, we can 
use formula (5) of Sec. 4.1 


Lf Gn) 
[Xi — I< my 


where |f (x)|Z=m, for acx<cb. 

We give another formula which permits estimating the absolute 
error in an approximate value x, if two successive values x,_, 
and x, are known, 

We will assume that the derivative f (x) is continuous on the 
interval [a, b] containing all the approximations and preserves sign; 
note that 





0<m SIF SM, <+ (5) 


For the sake of definiteness, let us assume that the successive ap- 
proximations x, to the exact root € are generated by formula (3) 
[a consideration of formula (4) is similar]: 


f (Xn- 1) (x, 


n= Xn- Fg, a) — F(a) 


—a) 


(n=1, 2, ...) where the endpoint a is fixed. Then, taking note of 
the fact that f (Ẹ)=0, we get 


es ee ee ee 


Xn-1— 4 
Using the mean-value theorem, we get 


(§ — Xn- DF (&n- 1) = (4y — Xn- JË (x n- 1) 


where &,_,€(x,-,§) and Xn- € (a, X,—,). Hence 





|F Ga- — F En-a) 
[E—¥n| = THE] 


Since f’ (x) has constant sign over the interval [a, b] and x,_,€ [a, b] 
and &,_ be [a, b], we plainly obtain 


IF (x n= port boa) SM emn 


| Xn — Xn- (6) 


We therefore derive from formula (6) 
[EX] <A x, | (7) 


where for m, and M, we can take, respectively, the smallest and 
largest values of the modulus of the derivative f’ (x) on the inter- 


126 Ch. 4. Approx. Solut’s of Algebr. and Transcendent, Eqs. 


val [a, b]. If the interval fa, b] is so narrow that the inequality 
M,<2m, 
holds, then from formula (7) we obtain 
(Ern Sl knna 
Thus, in this case, as soon as it is clear that 
|£a— Xr- <E 
where e is the specified limiting absolute error, then it is guaran- 
teed that 
|E—x,|<e 
Example. Find a positive root of the equation 
f (x) = x —0.2x? —0.2x—1.2=0 
to an accuracy of 0.002. 
Solution. First of all isolate the root. Since 
fFOy=—06<0 and f(2)=5.6>0 
the desired root € lies in the interval (1, 2). The resulting interval 
is great and so we halve it. Since 
f (1.5) = 1.425 
it follows that 1<§ < 1.5. 
Applying formulas (1) and (2) in succession, we get 


=I pap eygll-5—1) = 140.15 = 1.15, 
f (x)= — 0.173, 
0.173 
F (xa) = — 0.036, 
0.036 
x, = 1.190+ goes 5 pag (1-5 — 1.190) = 1.190 + 0.008 = 1.198, 
f (x) = — 0.0072 
Since f (x)==3x?—0.4x—0.2 and for %,<«< 1.5 we have 
F (x) = 3-1,198?—0.4-1.5—0.2 = 3. 1.43 —0.8 = 3.49 
we can take it that 


.0072 
0a gmas < EDE a 0.002 | 
Thus, E= 1.198 -0.0028 where 0< 8< 1. Note that the exact 
root.of equation (5) is §= 1.2. 


4.5 Newton’s method [method of tangents) 127 


4,5 NEWTON’S METHOD (METHOD OF TANGENTS] 


sues the root € of the equation 
f(x) =0 (1) 
is isolated on the interval [a, b] and f’ (x) and f” (x) are continuous 
and preserve signs for a<.x<<6. Having found an nth approxima- 
tion of the root, x, #E(a<x,<6), we can improve it by New- 
ton’s method in the following manner. 
Set 


E = Xp +h, (2) 


where h, is a small quantity. Then, applying Taylor’s formula, we 


have 
O= f (Xa H An) © F (Xn) + haf’ (Fn) 


E 
ha = — Fe) 


Inserting this correction into formula (2), we get the next approxi- 
mation of the root: 


qty = Fn pr (a=0, 1, 2,...) (3) 


Geometrically, Newton’s method is equivalent to replacing a small 
arc of the curve y=/(x) by a tangent line drawn to a point of 
the curve. Indeed, suppose, for definiteness, that /"(x)>0 for 
axx<b and f(b)>0 (Fig. 18), 


¥ 


Consequently, 











-n7 


Aa 
Pere T baat 
- 
ere a 


Choose, say, %,==0 for which f(x.) f” (x) > 0. Draw the tangent 
line to the curve y=f(x) at the point B,[x,, 7 (x,)]. For the first 
approximation x, of the root § let us take the ae of the point 


128 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


of ae of this tangent with the x-axis. Through the point 
B, [xı f(x,)] again draw a tangent line whose abscissa of the in- 
tersection point with the x- ane ee a second approximation x, 
_.of the root €, and so on (Fig. 18). It is plain that the O 
of the tangent at the point B,[x,, f(x,)] (4=0, 1, 2,...) is 


y—f (Xn) = F (xn) (x—x,) 
Putting y=0, *=%y4,, we get formula (3) 
Í (xn) 


ARIS ee F n) 





Note that if in our case we put x, =a and hence f (x,) F” (x) l 0, 
then drawing the tangent tothe curve y= f(x) at the point A fa, f(a)], 
we would get point x, (Fig. 18) lying outside the interval i bj; 
in other words, Newton’s method is impractical for such a choice 
of the initial vaiue. Thus, in the given case, a “good” initial 
approximation x, is one for which the inequality 

F) F'(%) > 0 (4) 
is valid. We will now prove that this rule is general. 

Theorem. /f f(a) f(b) <0, and Ë (x) and F (x) are nonzero and 
preserve signs over aK x <b, then, proceeding from the initial appro- 
ximation x,€ [a, b] which satisfied (4), it is possible, by using New- 
ton’s method |formula (3)], to compute the sole root & of equation (1) 
to any degree of accuracy. 

Proof. For example, suppose /(a)<0, f()>0, f(x) >0, 
F (x) > Ofor axtx<cb (the other cases are considered similarly). 
By inequality (4) we have f(x.) >0 (we can, say, take x,= b}. 

By mathematical induction we prove that all approximations 
Xn > &(n=0, I, 2,...} and, hence, f(x,) > 0. Indeed, first of all, 
Xo > &. 

Now let x, > §. Put 

E=x,+(6—%,) 


Using Taylor’s formula, we get 


| O= O= F ed) +f) Ert F (ed E—x,)* 8) 
where Ẹ < Cp < Xp 
Since f” (x) > 0, we have 


Fd +P (x,,) (§—x,) < 0 


and, hence, 
hap =h— f(x at SE 


F4 
which is what we set out to prove. 





-4.5 Newton's method {method of tangents} ~ 129 


Taking into consideration the signs of f(x,) and F (x,) we have, 
from formula (3), %,4,<x*, (n=0, 1, ...), that is to say, the 


successive approximations Xo, Xis ..-, Xp», -.. form a bounded de- 
creasing monotonic sequence. Therefore, the limit E= lim x, exists. 
i> D 


Passing to the limit in (3), we have 
z I) 
PEES 


or f(§&)=0, whence €=€, and the proof is complete. 

For this reason, when applying Newton’s method, one should 
be guided by the following rule: for the initial point x, choose the 
end of the interval (a, b) associated with an ordinate of the same 
sign as the sign of Y (x). 

Note 1. If: (1) the function f(x) is defined and continuous over 
— o <x¥< +00; (2) fla) F(b)<0; (3) f(x) 40 for acx<b; 
(4) F (x) exists everywhere and preserves sign, then any value 
cE [a, b] may be taken for the initial approximation x, when using 
Newton’s method to find a root of the equation f(x)=0 lying in 
the interval (a, b). One can, for instance, take x,=a or x,=0. 

Indeed, suppose, say, F (x) > 0 for aKx <b, f(x) > 0 and x,=e, 
where axc<b, If f(e)=0, then the “Toot —=c and the problem 
is solved. If } f(c) > 0, then the foregoing reasoning holds true and 
the Newton process with initial value c converges to the root 
E€ (a, b). 

Finally, if f (ec) < 0, then we find the value 


f (xo) He 
Fe) C FO? 


Using Taylor’s formula, we have 


T EA 1 FT 
ra= oiar otr e, PO=7 [FE] roo 
where c is a certain value intermediate between ¢ and x: 


Thus 
Fa) Pla) >0 


Besides, from ‘the condition f"(x) >0 it follows that F (x) isvan 
increasing function and, hence, F (x) >F (a) >0 for x>a. It is 
thus possible to take x, for the initial value of the Newton pro- 
cess converging to some root Ẹ of the function f(x) such that 
‘E>c yea. Since, because the derivative F (x) is positive when 
x >a, the function f(x) has a unique root in the interval (a, + 00), 
it follows that 


xy = Xy— 








=E (a, b) 


9 9616 


130 Ch. 4. Approx, soluts of Aigebr. and Transcendent. Eqs. 


A similar argument can be devised for other combingtons of signs 
of the derivatives f'(x) and f” (x). 


Note 2, From formula (3) it is clear that the larger the numerical 
value of the derivative f (x) in the neighbourhood of the given root, 
the smaller the correction that has to be added to the nth approxi- 
mation in order to obtain the (n+ 1)th approximation. Newton’s 
method is therefore particularly convenient when the graph of the 
function is steep in the neighbourhood of the given root. But if 
the numerical value of the derivative f’ (x) is small near the root, 
then the corrections will be great, and computing the root by 
this method may prove to be an extended procedure or sometimes 
even impossible. To ae then: do not use Newton’s method 
to solve an equation f(x)=0 if the curve y=f (x) near the point 
of intersection with the x-axis is nearly horizontal. 

To estimate the error-of the nth approximation x,, one can use 
the general formula (5) of Sec. 4.1: 

t Lf Ga) 
[s—2,| <1! (6) 
where m, is the smallest value of |F (x)| in the interval fa, b]. 

We now derive another formula for estimating the accuracy of 

the approximation x,. Applying Taylor’s formula, we have 


F (£n) =F [kai F (%,—4,-)] = i g i 
=f (£a) +f" (*,-1) (X.—% 4-4) +y F eo) (%,$—%,-1)” (7) 
where §,_,€(%,-,, ¥,). Since, by virtue of the definition of the 
approximation x,, we have 
Heya) EF (Xn—1) (X,—%p-1) =9 
it follows, from (7), that 
i 
UAIS M, (Sn — Xn) 
where M, is the largest value of |f” (x)| on the interval [a, b]. 
Consequently, on the basis of formula (6) we finally get 
]E—x,, <5 ETEEN (8) 
If the Newton process converges, then %,—x,.,~0 as n— oo. 
And so for n= N we have 
[E—*,|<]%,—*,-1| 


that is, the “stabilized” initial decimal places of the approxima- 
tions x,_, and x, are correct beginning with a certain approxima- 
tion. 


t—1 


‘ 


4,5 Newton's method {methed of tangents) 131 


Note that in the general case a coincidence, up to e, of two 
successive approximations x,_, and x, does not in the least gua- 
rantee that the value of x, and the exact root & (Fig. 19) coincide 
to within the same degree of ac- i 
curacy. 

We will also derive a formula that 
connects the absolute errors in two 
successive approximations x, and 
Xn41 From (5) we have 


Pn) 1 fn 
E= trn BF (x) a 








where c,€(x,, §), whence, taking 
into account (3), we have 


1 a 
E— Xn = ago s pas (6—*,)" 


and, consequently, 


[bra < Es, (9) | Fig. 19 





Formula (9) ensures a rapid convergence of the Newton process if 
the initial approximation x, is such that 


M 
im, ETag] 
In particular, if 


=<] and |&—x,}< 1077 


.then from (9) we get 
[E—X,41| < 107% 


That is, in this case if the approximation x, is correct to m de- . 
cimal places, the next approximation x,,, will be correct ‘to at 
least 2m places; in other words, if »<1, then Newton’s method 
ensures a doubling of correct decimal places of the desired root § 
at every step. 


Example 1. Using Newton’s method, compute a negative root of 
the equation f (x) =x!—3x?+75x—10,000=0 correct to five 
places. 


Solution. Setting x=0, —10, —100, ..., in the left member, 
we get f (0)=—10,000, f (--10) =—-1050, f (—100) ~ + 108. 

Thus, the desired root & lies in the interval —100<& < — 10. 
Reduce the interval found. Since f (—11) = 3453, then —l1 <E < 
<—10. In this latter interval, f(x) <0 and f"(x)>0. Since 
f(—11) > 0 and f’(—11) > 0, we can take x, =— 11 for the ini- 


132 Ch. 4. Approx, Solut’s of Algebr. and Transcendent. Eqs. 


tial approximation. The successive approximations x,(1=1, 2, ...) 
are computed in accord with the following scheme: 





Stopping with n=3, verify the sign of the value f (x,+ 0.001) = 
= f (—10.260). Since f(—10. 260) < 0, it follows that —10.261 < 
<&<—10.260 and either of 
these numbers yields the ue 
approximation. 


Example 2. Use Newton’s meth- 
od to find the smallest positive 
root of the equation tan x= x to 
within 0.0001. 

Solution. Plotting the graphs 
of the curves. y = tanx andy=x 
(Fig. 20), we conclude that the 
desired root & lies in the interval 
n<&< 3. Rewriting the equ- 
ation in the form 

f (x)= sin x—xcosx=0 
we have 
F (x) =xsinx, 
F (*)=sinx-+xcos x 


Whence f(x) <0 and f"(x)<0 for an<x< aa Since 

















re oi 3 

f +) =—], for the initial approximation we can take n= 
Perform the’ computations according to the following scheme. 
n | Xn Í (Xn) F (£n) h,=— t oe 

0 ŽA 4.71289 (270°) = —4.712 |—0.212 (~—12°10’) 

I 4.50004 (257°50’) —0.0291 —4.399 | —0.0066 (x —-22'44”) 


2 4.49343 (257°27'16") —0.00003 — = 





4.5 Newton’s method (method of tangents} 133 


In estimating the error in the piroman value x„ note that 
the successive approximations x, (n=0, 1, 2, ...) monotonically 
decrease due to the negativity of the second derivative f” (x), and 
[ (x,) <0. For this reason, we can take a <Ë < Xp, where X, is 


a value from the interval (a 5) such that F (X,) > 0. The value 


of x, can easily be found by inspection. [True, one could take 
x, =m, but this is not advantageous because f’ (n)=0.] Thus, for 
instance, for n=2 and assuming approximately 
Xz = 4.49340 = arc 257°27'12" 
we have 
f (x,) = sin 257°27’ 12” — 4.49340 - cos 257°27'12” = 
| = —0.97612 + 4.49340 -0.21724 = 
= —0.97612 + 0.97614 = + 0.00002 


Hence x, is chosen correctly and 


4.49340 < § < 4.49343 
We can put 
E = 4.4934 


which is correct to all digits written. l 
The estimate of the error in the value of x, can readily be made . 

more exact. Since, for x€[x,, x,], the -derivative f’ (x) decreases 
and f'(x) <0, then , 

m = min] f (x) |=| F (x) 
whence 

m, = 4.49340 -0.97612 > 4 
and, consequently, 


pagel 


0. oe 








< 10-5 ~ 
Thus 
. E = 4.49343 — 0.000010 
where 0< 9< 1. 
Example 3. Consider the equation 
f(x) =0 (10) 


where f” (x) is continuous and preserves sign over — œ < x < + œ. 
By Rolle’s theorem, equation (10) cannot have more than two 
real roots. Let us note two cases of practical interest. 

I. Suppose 


F (Xo) F (%) <0, F (%) F” (x) <0 
(Fig. 21). 


134 Ch. 4, Approx. Solut’s of Algebr. and Transcendent. Eqs. 


Then (10) has the unique root Ẹ in the interval (x,, x,), where 
__ Fo) 





a= % 


F (xo) 


Fig. 24 





The root Ẹ may be computed to the given accuracy by Newton’s 
method. i 
II. Let 
F (%)) = 9, F(X) f” (x) <0 
Then equation (10) has tworoots § and §’ in the interval (—oo, 
-+ œ) (Fig. 22). 





Fig. 22 Fig. 23 


Transforming the left member of (10) by Taylor’s formula, we 
have approximately 


PO) EP (x) aa) +P” (2) (e—a) = 0 


or 


F (4) HEF (%) (e—a) = 0 


4.6 Modified Newton method 135 


whence for the roots € and Ẹ' we get the initial approximations 
2f (Xo) 
ial Ve Tr) 


2 
x= %+ V- = = 2 


which are the abscissas of the points of intersection of the parabola 
l H 
Y= F) HEF (o) (X—%)? 


with the x-axis (Fig. 23). Subsequent improvements in the roots 
can be carried out by the usual Newton method. 

Assertions I and H are geometrically obvious. It is left to the 
reader to carry out a rigorous proof. 





and 





4.6 MODIFIED NEWTON METHOD 


If the derivative f (x) varies but slightly on the interval fa, b], 
then in formula (3) of the preceding section we can put 


F (xa) © F (xo) (1) 
B, 


Fig. 24 





From this, for the root € of the equation f(x)=0 we get the 
successive approximations 





Kyi =n Fr @=0 Toc (2) 


Geometrically, this method signifies that we replace the tangents 
at the points B,[x,, f(x,)] by straight lines parallel to the tan- 
gent to the curve y=/(x) at its fixed point B,[x,, [(x,)] (Fig. 24). 


« 


136 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


Formula (1) relieves us of the necessity to compute the values 
of the derivative f’ (x,) each time; and so this formula is very useful 
if F (x,) is complicated. It can be proved that on the assumption 
of constancy of signs of the derivatives f'(x) and f” (x) the succes- 
sive approximations {2} yield a convergent process. 


4.7 COMBINATION METHOD 


Suppose f(a) f(b) <0 and f'(x) and f” (x) preserve signs on the 
interval [a, b]. Combining the method of proportional parts and 
Newton’s method, we obtain a method, at each stage of which 
we find minor (too small) and major (too large) approximations 
to the exact root — of the equation f(x)=0. 

A consequence of this is that the digits common to x, and x, 

must definitely. belong to the exact root —. There are four theore- 
tically possible cases here: 


(1) f(x) >0, f(x) >0 (Fig. 25), 
(2) F(x) >0, P(x) <0 (Fig. 26), 
8 





Fig. 27 _Fig. 28 


4.7 Combination method 137 


(3) F(x) <0, F(x) > 0 (Fig. 27), 

(4) F(x) <0, fx) <0 (Fig. 28). 

We confine our a to the first case, The remaining cases 
are studied in similar fashion and the character of the computa- 
tions is easy to understand on the basis of the figures. It is worth 
noting that these cases can be reduced to the first one if we rep- 
lace the equation f (x)=0 by the egiiyaleni equations —f (xj =0 
or + f (—z)=0, where z =— x. 

Thus, suppose f’ (x) >0 and f’ (x) >0 for acx<b. Put x,=a, 
x,=0 and 


F (Xp) a 
Kya = Xa — = (4, — Xn), l 
E a a T P 
Xa =k — — pos (n=0, 1, 2, ...) (1') 
(At each step the method of chords is applied to a new interval 
[Xn Xa]. ) 

From what was proved above (Secs. 4.5 and 4.6) it follows that 

X LË < Xy 

and 7 
0 < E— Xp < Xa Xn (2) 


If the permissible absolute error in an approximate root «x, is 
specified beforehand and is equal to e, then the process of approach 
termimates as soon as we see that x, —x, <e. At the termination 
of the process, it is best, for the value of the root E, to take the 
arithmetic mean of the last values obtained: 


Eas (Xa +e) 


Example. Compute to within 0.0005 the sole positive root of the 
. equation ' 
F(x) = x* —x-—0.2= 0 
Solution. a F(1)<0 and f(1.1)>0, the root lies in the in- 
terval (1, 1.1). We have 
F (x) = 5x*—1 and F (x) = 20x5 
In the chosen interval, f’(x) > 0, f (x) >0, which means the 
signs of the derivatives are preserved. 
_ Let us apply the combination method assuming x,=1 and 
x= 1.1. Since 
f (=F (1)=—0.2, F(x) =F (1.1) = 0.3105, 
F (%) = F (1-1) = 6.3205 


138 Ch. 4, Approx. Solut’s of Algebr. and Transcendent. Eqs. 


the formulas (1) and (1’} yield 


0.1.0.2 


= 0,31051 
x= l -+g © 1.089, Ay 1.1— 


6.3205 


Since x, —x,= 0.012, the accuracy is not sufficient. We find the 
next pair of approximations: 


6.012-0.0282 aa 
x,= 1.039 + — 5 æ 1.04469, a = 1.051 — 


zæ 1.051 





0.0313 
5.1005 


Here, x, —x, = 0.00018, which shows the desired degree of accuracy 
has been achieved. We can put 


E= 5 (1.04469-+ 1.04487) = 1.04478 ~ 1.045 


zæ 1.04487 


with absolute error less than 


+ . 0.00018 + 0.00022 = 0.00031 < + 10-8 


4.8 THE METHOD OF ITERATION 


One of the most important methods in the numerical sojution 
of equations is the method of iteration (often also called the method 
of successive approximations). Essentially, this method consists in 
the following. Suppose we have an equation 

f(x) =0 (1) 
where f(x) is a continuous function and it is required to deter- 
mine its real roots. Replace (1) with an equivalent equation, 

x= @ (x) (2) 
_In some way choose a roughly approximate value of the root, 
Xə» and substitute it into the right member of (2) to get a number 

x1 = Q (xo) (3) 


Now inserting x, in the right member of (3) in place of x, we 
get a new number x,=@(x,). Repeating this process, we get a se- 
quence of numbers 


Xn =P (X_-1) n=l, 2, ++) (4) 
If this sequence is convergent, that is, if there exists a limit 
g= lim x,, then, by passing to the limit in (4) and assuming the 
junction (x) to be continuous, we find 
lim x, = ( lim Samt) 


n> w n> o 


or 


b= (5) (5) 


4.8 Method of iferation 139 


Thus, the limit Ẹ is a root of (2) and can be computed by for- 
mula (4) to any desired degree of accuracy. 
_ Geometrically, the method of iteration can be explained as fol- 
lows. Plot on an xy-plane the graphs of the functions y= x and 
y=(x). Each real root € of equation (2) is an abscissa of the 
point of intersection M of the curve y=@(x) with the straight 
line y= x (Fig. 29). 





Starting from a point A,[x,, @(x,)], we construct the polygo- 
nal line A,B,A,B,A,... (“staircase”), the segments of which are 
alternately parallel to the x-axis and to the y-axis, the vertices 
A,, A, A, ... lie on the curve y=q(x), and the vertices 
B,, B, B} ... lie on the straight line y=x, The common abscis- 
sas of the points A, and B,, A, and B,, ..., will obviously be, 
respectively, the successive approximations x,, x,, ... to the root &. 

It is also possible (Fig. 30) to have a different kind of polygonal 


y 





Fig. 30 


140 Ch, 4, Approx. Solut’s of Algebr. and Transcendent. Eqs. 


line A,B,A,B,A,... (*spiral”). It is readily seen that the “staircase” 
solution is obtained if the derivative @’(x) is positive and the 
“spiral” solution if p’ (x) is negative. 

In Fig. 29, the curve y= ¢ (x) slopes in the neighbourhood of 
the root & that is, |g’ (x)|< 1 and the process of iteration con- 
verges. However, if we consider the 
case where |g’ (x)|> 1, then the 
process of iteration may be diver- 
gent (Fig. 31). Hence, to apply the 
method of iteration practically, we 
have to ascertain the sufficient con- 
ditions for the convergence of the 
iteration process. 


Theorem 1. Let a function ọ (x) 
be defined and differentiable on an 
interval [a, b] with all values 
p(x) €[a, b]. 

Then if there exists a proper 
fraction q such that 


le’ (s)|<q<l (6) 
for a< x <b, then: (1) the process of iteration 
=P) (n= 1, 2, ...) (7) 


converges irrespective of the initial value x,€{a, b]; (2) the limiting 
value 





E= limx, 


is the only root of the equation ; 
x= 9 (x) (8) 
on the interval [a, b]. 
Proof. We consider two successive approximations 
Xn = P (Xn) and ps = P (Xn) 
“which are definitely meaningful by virtue of tHe conditions of the 
theorem), whence 
Kni Xn = O (hn) —@ (Xp -1) 
Applying the mean-value theorem, we have 


Xnti Fn = (Sa —%n 1) p (xn) 


1) For the number g we can take the least value or a lower bound of the 
modulus of the derivative |p’ (x)| for asx <b. 


4.8 Method of iteration 141 


where x, € (X,4, %,). And so, on the basis of (6), we get 


|¥n41—¥n | S4 | 4n — 4n- (9) 

From this, assigning the values n= 1, 2, 3, ... in succession, we 
derive 4 “3 

—*,|<9|4—%9|, 
oe Soe pamti 
[Xari — Xn | SQ" |x — 0] (10) 
Let us consider the series i 
Xot (Xi — Xo) + (Xa— 4) H o +n xna) Ho (11) 


for which our successive approximations x, are (n-- l)th partial 
sums, that is 


Xn = Sny 
By the inequality (10), the terms of the series (11) are less in absolute 
value than the corresponding terms of a geometric progression with 


ratio g-< 1, and so the series (11) converges and converges absolutely. 
‘Hence 


Jim Spa = lim x, =§ 
and obviously € fe, bj. 


Passing to the limit in (7), we obtain, by virtue of the conti- 
nuity of the function ẹọ (x), 


E= ọ ($) ~ (12) 


Thus, : is a root of equation (8), which does not have any other 
root on the interval [a, b]. Indeed, if 


E=p É) (13) 
then from (12) and (13) we obtain 
E— =o (E) p E) 
and, consequently, 
(E—£) [l—g' (c)]=0 (14) 
where cE [E, Ẹ]. Since the expression in square brackets in (14) is 
not zero, it follows that &=&, that is, the root € is the only one. 


Note 1. The theorem. remains valid if the function p(x) is defined 
and. differentiable in the infinite interval — œo < x < + œ, and 
inequality (6) holds true when x€(— œ, + 00). 


142 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


Note 2. Under the conditioris of Theorem 1, the method of iteration 
converges for any choice of the initial value x, in fa, b]. For this 
reason it is self-correcting, that is an individual error in the com- 
putations that does not go beyond the limits of the interval fa, b] 
will not affect the final result since an erroneous value may be 
regarded as a new initial value x, Only the amount of work may 
increase. The property of self-correction makes the method of iteration 
one of the most reliable computational methods. Quite naturally, 
systematic errors in applying this method can prevent one from 
obtaining the result required. 


Estimate of an approximation. From formula (10) we have 
(nep n| Se Sep nepal pep a ntp- ++ 
vee] Xp Xn | SGP | Hy — Hy] GPP Fx, — x + 
oe $GQ |X —% l= 9G" |4,—-%H [UI tqg+g+...+9?7) 

Summing the geometric progression, we obtain 


lg? ga 
Marsi S19" | — |< Tog —%| 


Allowing the number p to go to infinity and taking into account 
that lim x,,,=6, we finally get 
p> n 


irl < lal (15) 


From this it is clear that the convergence of the process of iteration 
will be the faster, the smaller the number q. 

Approximations can be estimated by another formula, which 
finds application in certain cases. Let 


i (x)= x—@ (x) 


It is obvious that 7’ (x)= 1—g’ (x) > 1—q, whence, noting that 
f(§)=9, we. get 


liao (Sn) 1=1F x) E |= 1%, — BF On) |S (9) | 
where x, €(x,, E) and, hence, 
[i S eE (16) 
that is 
[Xn 1—*n| r 
[E— x] ea (16%) 
Using formula (9} we also get 


[E= rn S (k m Fa | l (16”) 


DRE 
l—q 


4.8 Method of iteration 143 


whence it follows, in particular, that if Vee then 


|E—x, | < [n — xn] 


In this case, from the inequality k — xna] < e follows the ine- 
quality 
|E—x, | <e 


Note. There is a widespread opinion that if when using the method 
of iteration two successive approximations x,_, and x, coincide to 
within the specified accuracy eg 
(for instance, the first m decimals 
are stabilized in these appro- 
ximations), then the equality 
E x, holds true with the same 
accuracy (that is, in particular, 
in the given example the appro- 
ximate number x, is correct.to 
m places!). As Fig. 32 so vividly 
reveals, this assertion is erro- 
neous in the general case. What 
is more, it is easy to demonst- 
rate that if g’(x) is close to 
unity, then the quantity |&—x, | 
may be large, although the quan- 
tity [x,—%,_,| is extremely Fig. 32 
small. 

Formula (16”) enables one to estimate the error in the approximate 
value en the discrepancy between two successive approximations 
X,- and xp. 

The process of iteration should be continued until the inequality 


T 
[in tal E 


4 





holds true for the two successive approximations x,_, and x,; here, 
e is the specified limiting absolute error of the root § and |g’ (x)|<q. 
Then, by virtue of formula (16"), the inequality 


|§—x,|<e 
is valid; that is, 
=x, +e 
Note that if 
n= P (Xn) 


and 


144 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


then 
|E—x, | =j|ọ É) — oP (X,-1) |= 


= [Emai I] '(%,—1) | < q|E— xn] [Xn E (424s 5 


that is, 
(5 — x, ] sS | E — xp] 


Thus, in a convergent iterative process the error |§—x,]| tends to 
zero monotonically, which is to say, each successive value x, is 
more exact than the preceding value x,.,. In all these conclusions 
we of course ignore rounding errors; it is assumed that the successive 
approximations are found exactly. 

Ordinarily, in practical situations a crude technique is used to 
establish the existence of a root € of equation (2) and-then by the 
method of iteration it is required to obtain a sufficiently exact 
approximate value of the root; inequality (6) is valid only within 
a certain neighbourhood (a, b) of this root. If the choice of the 
initial value x, here -is inept, the successive approximations 
x = (X,-1) (n=1, 2, ...) can leave the interval (a, b) or even 
become meaningless. It is therefore useful to have an alternative 
statement of Theorem 1. 


Theorem 2. Let a function p(x) be defined and differentiable on some 
interval [a, b], and let the equation 


x= (x) (17) 
have a root & located in a smaller interval {a, B], where a= a + 
+4 (b—a) and B=b—z(b—a) (Fig. 33). 


Fig. 33 Ee See ee a awe ee een 
“pag eg PO ae z 
3 5 E 


In this case if (a) |p’ (x)|<q<1 for ax<x<b and (b) the 
initial approximation x,€ [a, B], then: 
(1) all successive approximations lie in the interval (a, b): 
Xn = @ (X,_1) € (a, b) (nT, 2; 2.5) 


(2) the process of successive approximations is convergent; that is, 
there exists 
lim x, =& 
and & is the only roof on the interval [a, b] of equation (17), and 
(3) estimate (15) is valid. 


4.8 Method of iferation 145 


Proof. (1) Indeed, suppose 


x €[a, P] 

Then the equation 
x= (%) 

is obviously meaningiul. Utilizing the equation 
E= p (§) 


= we get, on the basis of the mean-value theorem, 


[4-8 |=l0 )—9 @l=lm—Ell 9’ Gl <9 (B—a) < 234 





whence . 
xy € {a, b) 
Generally, if Xn-1 E (a, b) (n= l. 2, pas -) and | Xa- — 5| < os, 
then 
Xn =P (Xa) 
is meaningful and 


|x,—§|=|9 (x, 1) — P (§) | = | N p iS) < 
<q [Xn-1 Š] X as 





Consequently, x, € (a, b) where n=1, 2, 3, .... 
The proofs of assertions (2) and (3) are completely analogous to 
the proof of Theorem 1. 


Note. Suppose that in a certain neighbourhood (a, b) of the 
root & of equation (17), the derivative g’ (x) preserves sign and the 
inequality 

lp w sgl 
is valid. 

Then if the derivative ọ'(x) is positive, the successive approxi- 
mations 

Xn = Ọ (Xn -1) (n= l, 2, ah x, € (a, b) 


converge to the root § monotonically. 
However, if the derivative (x) is negative, then the successive 
approximations oscillate about the root &. 


(1) Indeed, let O<’ (x) <q < 1 and, say, 
Hs 
Then 





%,—§ = 9 (x) —@ (6) = (% +8) & E) <0 
where &, € (xə $), and 
[aisg] x5] 


10 9616 


146 Ch. 4. Approx. Solut's of Algebr. and Transcendent. Eqs. 


Consequently 
Kyou <$ 
Using the method of mathematical induction, we obtain 


Mig ON Ny Soni 
(Fig. 34a). ; 


Fig. 34a Ty Ly L, é 
ww 


A similar result is obtained when x, > &. 

Thus, if the derivative «’ (x) is positive, it suffices only to 
choose the initial approximation x, belonging to the neighbour- 
hood (a, b) of the root € that interests us; al] the remaining 
approximations x,(n=1, 2, ...) will automatically lie in this 
neighbourhood and will monotonically approach the root € as n 
increases. 


(2) Let —1 < —q<q’ (x) <0 and, say, x) < §&; x, = (%,) € (a, b). 
We have 
x, —§ = p (%))—@ (8) = (% —&) P (E,) > 0 
that is, x, >€ and [x,—&|< [¥o BL 


Repeating these arguments for the approximations x,, x, ..., 
we get 


Ky Shp PRES he ONE 


Thus, the successive approximations oscillate about the root € 
(Fig. 34b). 


Fig. 34b Zp z € z; z 


Thus, in the case of a negative derivative ọ'(x), if two appro- 
ximations x, and x, belong to the neighbourhood (a, b) of the 
root £, all the other approximations x, (n= 2, 3, ...) also belong 
to this neighbourhood; the sequence {x,} “strangles” the root &. 

Note that, obviously, 


[b—=%, |< 4a — | 


that is, in this case the stabilized digits of the approximation x, 
definitely belong to the exact root &. 


Example 1. Find the real roots of the equation x—sinx=0.25 to 
three significant digits. 


4.8 Method of iteration 147 


Solution. Write the equation as 
x=sinx+0.25 


We establish graphically that the equation has one real root § 
approximately equal to x,=1.2 in the interval [1.1, 1.3] 


(Fig. 35). 


Fig. 35 





Using the notation of Theorem 2, we put 
a=1.1 and ĝ=1.3 


whence 
a=a—(p—a)=0.9 = arc 52 
and 
b= f + (B—a)=1.5 = arc 86° 
Since 
ọ (x)= sinx-- 0.25 
and 


gp’ (x)= cos x 
we have, for 09 <x < 1.5, 
|p’ (x) |< cos 52° = 0.62=q 
If we choose x,€ (1.1, 1.3), then all the conditions of Theorem 2 


will be obeyed and, hence, we will be assured that the successive 
approximations 


xX, = sinx,_,+0.25 (n=1, 2, ...) 


(1) lie in the interval (0.9, 1.5) and (2) x,—€ as n— œ. 
Choosing x,=1.2 and specifying, according to the nee 
of the problem, the limiting aed error 


e=+- 107-2 


. we construct the successive approximations x, (n=1, 2, ...) until ` 
two adjacent approximations x,_, and x, coincide to within the 


148 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


limits of error equal to 
—te= 0.5! -+1072 æ 0.0025 
We have 
x =sin 1.2 +0.25 = 0.932 + 0.25 = 1.182, 
x, = sin 1:182 + 0.25 = 0.925 + 0.25= 1.175, 
X,= sin 1.175 + 0.25 = 0.923 + 0.25 = 1.173, 
x, =sin 1,173 +0.25 = 0.922 + 0.25 = 1.172, 
x, = sin 1.172 +0.25 = 0.922 + 0.25 = 1.172 
The fourth and fifth approximations coincide to within four 
significant digits. Therefore [see (16”)] 


0.62-0.001 
1— 0.62 





cna = 0.0016 


Since the limiting absolute error of the approximate root x, (inclu- 
ding the rounding error) does not exceed 


E = 0.0016 + 0.002 < 4-107? 


we can take it that 


£= 1.17 + 0.005 
Note. The given equation 
f(x)=0 (18) 
may be written as 
x= @ (x) (18’) 


choosing the function g(x) in different ways. 

The notation of (18’) is by no means immaterial; in some cases 
|p’ (x)| will prove to be small in the neighbourhood of the desi- 
red root £, in others it will be large. For the method of iteration, 
the most advantageous representation of (18’) is that in which the 


inequality 
lp’ (l<q<l (19) 


is valid, and the smaller the number q, the faster, generally 
speaking, will the successive approximations converge to the root &. 
We give here one rather general technique for reducing equa- 
tion (18) to the form (18’), for which the validity of inequality ` 
(19) is ensured. Let the desired root € of the equation lie in the 
interval [a, b], and l i 
O<m<f (<M, (20) 


for axix<b.!) In particular, we can take for m, the smallest 


D If the derivative f’ (x) is negative, then we consider the equation —f (x)=0 
instead of f (x) =0. 


4.8 Method of iteration 149 


value of the derivative f'(x) on the interval [a, b], which value 
must be positive, and for M, the greatest value of F(x) on the 
interval [a, b]. Replace (18) by the equivalent equation 


X=X—A)f (x) (A> 0) 
We can set p(x) =x—Af (x). i 
We choose the parameter A in such a way that in the given 
neighbourhood [a, b] of the root § the inequality 
Osp (X)=1—A (x) <q <I (21) 
is valid, whence, on the basis of expression (20), we get 
0<1—)M, <1—An, <q 


Consequently, we can choose 
and 


Inequality (21) is thus valid. 
Example 2. Find the largest positive root € of the equation 
i e+ x= 1000 (22) 
to within 1074. 
Solution. A rough guess gives us the approximate value of the 


root x,=10; it is obvious that & < x,. 
Equation (22) may be written in the form 





x = 1000 — x? (22) 

or 
1000 I F 
I= e r (22") 

or 
x=j/ 1000—x (22) 


and so forth. The most suitable version is (227) because by ta- 
king the interval (9,10) for the main interval and setting 


p (x)= 7 1000 —x 
we have 
—l 


if X) = — 
P) 3}/ (1000 — x}? 


150 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs, 


whence ` 
, 1 on S 
|p CES 3 g 3004 


Compute the successive approximations x, with one extra digit 
using the formulas 


Y, = 1000 —x,, 


Xan =y y, (n=0, 1, 2, ...) 
The values thus found are listed in Table 4. 


TABLE 4 


VALUES OF THE SUCCESSIVE APPROXIMATIONS 2, 
AND Ua, 





0 990 

1 9.96655 990.03345 
2 9.96666 990 . 03334 
3 9,96667 





Since 1—q=1, we can put €=9.9667 to within an accuracy 
of 1074. 

` The method of iteration can also be used to compute the roots 
of equations given in the form of power series. 


Example 3. Find the real root of the equation [2] 


Bo x x? ru 
x—ş t0 pt 76 BoT: 

ES yen i = a) 

ee (HU ay t + = 04431135 


Solution. We have x= (x), where 


x> x? gd xii 
p(x) =0.4431135 + 4 -E+E t 


Neglecting all powers of x above the first, we determine an 
approximate value of the root to be 0.44. Then 
qp (0.44) = 0.47, 
gp (0.47) œ 0.476, 
q (0.476) ~ 0.4767, 
(0.4767) œ 0.47689, 


4.8 Method of iteration _ 154 
S u 


x = @ (0.47689) ~ 0.475927, 
x, = Ẹ (0.476927) œ 0.476934, 
x, = 9 (0.476934) œ 0.476936 
Hence, = 0.47693. 
We give another technique for accelerating the convergence of 
the process of iteration which may be useful in certain cases [7]. 
Suppose we have an equation 
«= 9 (x) 
such that the inequality 
Jo (| Sk>1 


holds within the neighbourhood of the desired root E. Then the 
process of iteration will diverge for this equation. But if the 
given equation is replaced by the equivalent equation 


x= (%) 
where p(x)=q7t(x) is the inverse function, we get an equation 
for which the process of iteration converges since 


l 


, = 1 





Example 4. The equation 
f(x) = *®—x—1=0 (23) 
has a root €€(1, 2) since f(1)=—1 <0 and /(2)=5>0, 
Equation (23) may be written as 
x= x—1] (24) 
Here 
p(x)=x8—! and g (x)= 3x? 
and so 
g’ (x)23 for lax<2 
and, hence, the conditions of convergence of the process of itera- 


tion are not fulfilled. 
'f we write (23) as 


x=Varl (25) 


we will have 


phx) = y x+ and (x= 





3y (e+ lp 
Whence 0< Y (x) < oe <- for |<.x<2 and, hence, for 
4 


equation (25) the process of iteration will converge rapidly. 


152 Ch. 4. Approx. Solut’s of Aigebr. and Transcendent. Eqs. 


4.9 THE METHOD OF ITERATION FOR A SYSTEM 
OF TWO EQUATIONS 


Let there be given two equations in two unknowns: 


F(x, y)=0, } a 
F, (œ, u) =0 ) 
whose real roots it is required to find to within a specified degree 
of accuracy. 

We assume that the system (1) allows only for isolated roots. 
We can establish the number of these roots and give their crudely 
approximate values by constructing the curves F,(x, y)=0, 
F,(x, y)=0 and determining the coordinates of their points of 
intersection. 

Let x=x,, y=y, be the approximate values of the roots of 
system (1) obtained graphically or in some other way (say, by a 
rough guess).. 

We offer an iteration process which, under certain circumstan- 
ces, permits improving the given approximate values of the roots. 
To do this, represent (1) as 


x= (x, y), ) 
Y=, (%, y) 
and construct the successive approximations according to the fol- 
lowing formulas: 

=, (Xo, Yo)» Y= Pe: (Xo, Yo)» 
X= P (%y, Yı) Yo= Pa (Xo Y1)» 

ED te Ae, ae tats yeh Be, aa aaae ho N (3) 
Xni = Pi Xn» Yn)» Yn+1 = Pe (Xn Yn) 


(2) 


If the iteration process (3) converges, that is, if there exist the 
limits : 
E=limx, and yn=limy, 
Noo 


>a 


then, assuming the functions p,(x, y) and 9, (x, y) to be conti- 
nuous and passing to the limit in the general-form equation (3), 
we get 

lim Ant = lim Pi (Xn Yn) 

how n> o 


lim yn = liM Q; (Xps Yn) 
nop nen ' 


4.9 Method of iteration for a system of two equations 153 


Whence 
b=. &, n), N= P & n) 


that is the limiting values § and y are roots of system (2) and, 
hence, of system (1) as well. Therefore, taking a sufficiently large 
number of iterations (3), we get the numbers x, and y, which 
will differ from the exact roots: x= and y= of (1) by an 
arbitrarily small value. The problem is thus solved. If the itera-’ 
tion process (3) diverges, it cannot 
be used. 


Theorem. /n a closed neigh- 
bourhood R faxx SA; bx y <B} 
(Fig. 36) let there be one and only 
one pair of roots x=% and y =n of 
system (2). If: (1) the functions 
(x, y) and @,(x, y) are defined 
and continuously differentiable in 
R; (2) the initial approximations 
Xo Yo and all succeeding approxi- 
mations x,, y, (n= 1, 2, ...) belong 
to R; (3) the following inequalities 
are valid in R: 




















then the process of successive approximations (3) converges to the 
roots Xx=€ and y=nņ of system (2), that is, f 
linx,=§ and limy, =% 

now 


n —> © 


Note. The theorem holds true if Condition (3) is replaced by 
Condition (3°): ; 














Oi 91 
altel Sa <i, 
Ops | _ Ope 
Ox |e Sh < l 












A rough proof of this theorem is given in [2]. A more general 
theorem is given in Secs. 13.10 and 13.11. 


Example. For the system [2] 
file, y) = 2x xy—5x+ 1 = 0, 
f(x% y)==x+3 log 4 — y = 0 ) 
find the positive roots to four significant digits. 


154 Ch. 4. Approx. Solut’s of Algebr. and Transcendent. Eqs. 


Solution. Plot the graphs of the functions f,(*, y)=0 and 
fa(x, y)=0 (Fig. 37), The approximate values of the roots that 
interest us are 


= 3:5, Yo 22 


Fig. 37 





To apply the method of iteration, write the system as 


5-1 
r= y DH (x, Y), 


y=Vx+3 log, = p, (4, Y) 
We find the partial derivatives: 
10M 
OP y+5 Op, _ + x 


i ay UTD Ox 2 e+ Sogo 
2 


Q 


where M = 0.43429, 


Restricting ourselves to the neighbourhood R {|x—3.5/<0.1, 
|y—2.2|]<0.1} we have 








apy 23-45 
Ox /34(21+45)—1 i 
A eal 
ON 3.6 
eA < 0.27, 





<< 
/3.4(2.1+5)—1 
2 


y 


4.9 Method of iteration for a system of two equations 455 


























3-0.43 
pee 
pefe t 
ðx s 2 V 3.443 logi 3.4 Ses 
Ape | _ 
Tr =0 
whence 
os a “oe | < 0.54 +.0.42 = 0.96 < 1, (4) 
| anak <0.274+0=0.27 <1 (5) 


Thus, if the successive approximations (x,, y,) do not leave the 
region R (this is easy to see as the computations progress), the 
iteration process will converge. 

The relative proximity of the sum (4) to unity permits assu- 
ming that the iteration process in this case will converge compa- 
ratively slowly. Begin computing the successive approximations 
by the formulas 


n+ b)— 1 
Xn 44 y= uy + ) ’ 
Yna = VX, 3l% (SOs. 1, 2, aaa) 


The corresponding values of the successive approximations are 
given in Table 5. 


TABLE 5 


VALUES OF THE SUCCESSIVE 
APPROXIMATIONS x, AND yn 





We can thus take £= 3.487 and n= 2.262. 


Note. In place of the above-considered process of successive ap- 
proximations e it is sometimes more convenient to use Seidel’s 
process: 

Xn +1 = Py (Xn Yn), 
Ynti = P, (x, nti Yn) (n=0, 1; 2, aed 


156 Ch. 4. Approx. Soluf’s of Algebr. and Transcendent. Eqs. 


The method of iteration for general systems is considered in 
Secs. 13.8 to 13.11. 


4.10 NEWTON’S METHOD FOR A SYSTEM 
OF TWO EQUATIONS 
Let x,,y, be the approximate roots of the system of equations 
F(x, y)=0, G(x, y)=0 (1) 
where F and G are continuously differentiable functions. Putting 
X= Xp, + hps Y = Ynt Rn 
we get 
F (Sat hns Yn HEr) = 0, 9 
G (xa Fin Yn + hn) = 0 z 
Whence, using Taylor’s formula and confining ourselves to linear 
terms in h, and k„, we get 


























F (ns Yn) + PaF (Xn Yn) + RP y (Xn Yn) = 0: \ 3 
G (Xn Yn) tA,Gx (X> Yn) +RGy (Xn Y,)=0 
If the Jacobian 
O PLO Yd FO. Ya) 
J (Xas Yn #0 
( y = le G; (Xn» Yn) Gy (£n, Yn) 
then from system (3) we have 
l F (Xn, In) Fy (%ns Un) 
h= a 
n I Enr Yn) |G (4, Yn) Cy (Sn Yn) # 
1 Fy, Ons Ga) F (Sis Yn) 
poce T n n 
n J (Xn, Yn) G; (Sns Yn) G (xas Yn) Á 
We ean thus put 
TE 1 F (Xps Yn) a Yn) 
Rete An — Fas) G (Xa. In) Yn)» 
7 i VES Gag) oe Yn) , 
Ynt+1 = Yn I (Xn Yn) Gi (Xn, Yn) G (£a Yn) oY 
(n=0, 1, 2, 


The fnitial approximations Xo, Yo are determined very roughly. 
Example. Find the real roots of the system 

F (x, y) = 2x*—y*—1=0, 

G(s, y)=xy—y—4=0 (1) 


4.11 Newton's method for the case of complex roots 157 


Solution. Graphically we obtain crude approximations to the va- 
lues of the roots: 


My 1.2, y= 1.7 
Substituting them into (1) we get 

F (1.2, 1.7) =—0.434, 

G(1.2, 1.7) =0.1956 


Compute the Jacobian 





6x? —2y 
J (x, y= y? 3xy? ssj | 
whence 
8.64 —3.40 
=i o1 at 


We compute A, from formula (4): 


1 |—0.434 —3.40 3,389 
hy = — 577510 0.1956 0 pro = 0-0349 
and from this, using (6), we obtain 
x, = 1.2 -+ 0.0349 = 1.2349 
Compute k, from formula (5): 
1 18.64 —0.434 
ko=— 97.910 | 4.91 o 1956) 70-0390 





whence, by (6), we get 
y, = 1.7 — 0.0390 = 1.6610 
Repeating this process with the roots obtained, we find x, = 
= 1.2343, y,=1.6615 and so on. 


Newton’s method for general systems is considered in Secs. 
13.1 to 13.7. 


4.41 NEWTON'S METHOD FOR THE CASE 
OF COMPLEX ROOTS 
In practical situations (for-instance when solving linear differen- 


tial equations) it may be necessary to improve the complex roots 
of a given equation 


f(z)=0 (1) 


A method similar to Newton’s may sometimes be used for this 
purpose.” 


158 Ch. 4, Apprex. Solut’s of Algebr. and Transcendent. Eqs. 


Suppose that f(z) (z=x+ iy, i=— l} is an analytic function, 
in some convex” neighbourhood U of its simple isolated zero 


C=E+im (fF (G)=0, F (0) #0) 


which, generally speaking, is complex. Let z, be an approximate 
value of the root, which value Hes in the neighbourhood U, and 
let 


enti = en Eg Az, 


be an improved value of the root. Using a Taylor series expansion 
at the point z,, assuming f(z,+4,) 0 to within Az}, we have 


f (Zn 41) a F (2a) + Az,f (Za) =0 


whence 





AZ, = ae Gn) (2) 


Thus, starting with a value z,, we can, step by step, obtain sub- 
sequent approximations of the root using the formula 





a = Za he (n=0, 1, 2, ...) (3) 
Ifz,€U (n=1,2,...) and the sequence {z,} converges, then 
the limit 
Ein z, 


Now 


is a root of equation (1). Indeed, passing to the limit in (3) as 
n — oo, we have 


lim f (z,) 
fim zas = lim 24 — Tan F Ten) 
or 
+ FO 
ae ad 
Consequently 


F(g)=0 
To estimate the error in the approximate value z,, assume thai 
|i’ (z)|sS=m,>0 for zeU 
Then for the given function 


w=} (z) 


1) That is, any two points in U are endpoints of a segment which is also 


in U. 


4.44 Newton's method for the case of complex roots 159 


there exists, in a sufficiently small R-neighbourhood of the root €, 
a single-valued inverse function 


z= f~ (w) 


defined in a neighbourhood |w] < p, the derivative of which is, as 
we know, 


dz i 
w — F (4) 
Assuming that |f (z,)| <p we have 
F (2n) F (zn) 


= E G rya dt 
t= fey | Zur Old= | wey © 
P(E) 0 

where ¢ is a point ranging over the rectilinear segment between 
the points f(€)=0 and f(z,) (Fig. 38). 
Since | ¿|< p, it follows that |f~1(¢)|<R 
and, hence, 

If (F>* (2) | em, 
From this, on the basis of (5), we get 


f (2n) 
Jat) lE] 
a G1 f PD S m O 








We give without proof the sufficient Fig. 38 
conditions for the existence of a root of 
equation (1) which follow from the Ostrowski theorem. 
Theorem. If a function î(z) is analytic in a closed R-neighbour- 
hood of a point z, and the following inequalities hold: 


ee < Ao 








<B <, 


(3) I ISC for |z—z,|< R, 
(4) 24,B,C =p, <1 
then equation (1) has a unique root € in the domain |z—z,|<R 
and Newton’s process (3) defined by the initial approximation z, 
converges to this root, that is, 
C=limz,. 


A>D 








The rapidity of convergence of the process is characterized by the 
estimate 
n=1 ony 


-al< Bi (5) Mo (7) 


160 Ch. 4. Approx. Solut’s of Aigebr. and Transcendent. Eqs. 


Example. Find approximately the smallest, in modulus, roots of 
the equation 


f(z) =e? —0.22+1=0 (8) 
Solution. Here 
F (z2j=e?—0.2 
Since f’ (z)=0 for z=1n0.2 ~—1.79 and 


f(—c)=+00, F(Z) >0, f(+00)= +00 

it follows that equation (8) does not have any real‘roots. 

For the initial approximation of the desired root € we- take the 
smallest, in modulus, root z, of the equation 

e?7+1=0 
whence we can put 
Zo = ni 
The succeeding approximations z, (A= 1, 2,3,...) of the root ¢ 
are successively found from formula (3): 
f (2p) ~ 02m 5 





























= %— py Hee 2.6182, 
= f(z) _ 5ni 0.132—0.024i _ ; 
Z= aF) = EEE = 0.069 + 2.6241, etc. 
The results of the computations to within 0.001 are given in Table 6. 
TABLE 6 
IMPROVING COMPLEX ROOTS BY NEWTON’S METHOD 

zn 7 —_ En, 

n Zn e f (2n) F (Zn) Az, =— ead 
0 3,142:)]—1 —0.628i|—1.2 — 0,524: 
| 2.618: |—0.868 + 0.57 0.132 — 0,0247|— 1.068 -+--0.57 0.153 0,04Ci 
2/0.153 + 2.658: |—1.030 -+ 0.541//—0.061 + 0.009i/—1.230-+ 0.541i/—0.044 — 0.012 
310,109 + 2.6467|—-0.978 + 0.5357 0 +-0.006i|—1.178 + 0.5351} 0.002 + 0.004; 





4}0.107 + 2.650:|-—0.981 + 0.525i|—0.002 — 0,005/|— 1.181 + 0.525i|—0,000 — 0.004: 
50.107 + 2,646i|—0.977 + 0,534i|4-0.002 + 0,004i|—1.177 + 0.5347 


To compute e? for z=x-+-iy we used the familiar formula 
e= e* (cosy+isiny) 
Assuming 
C œ z, = 0.107 + 2.6463 
we have 
f (z+) = 0.002 + 0.0047 


4.11 Newton's method for the case of complex roots 164 


Approximately taking it that 
m,=|f° (25) | & 1.3 
we obtain the error on the basis of (6): 
IF (25) | __ 0.001. V20 


[b—z,| 2 el Ew 0.004 


Since the left member of (8) takes on real values for real z, this 
equation also has the conjugate root 


t æ 0.107 —2.646i 


which is equal, in modulus, to the root £. Indeed, we have 
f(e)=F@)=0 


Note. An alternative method of solving (1) is to reduce it to a 
system of two real equations. Setting 


z=x+iy 


in (1) and isolating the real and imaginary parts of the function 
F (2), we get - 


f(z) =a(x, y) +(x, y)=0 


where u and v are real functions. From this we find that (1) is 
equivalent to the system 


u (x, y)=0, 
ve =o | o 


Improvement of the roots of a system of type (9) is considered in 
Secs. 4.9 and 4.10. Note also that this new method is suitable for 
the case when the function f(z) is nonanalytic. 


REFERENCES FOR CHAPTER 4 


[1] Ya. S. Bezikovich, Approximate Computations, 1949, Chapter VI (in Russian). 

[2] J. B. Scarborough, Numerical Mathematical Analysis, 1955, Chapters IX, X. 

[3] z a eke and G. Robinson; The Calculus of Observations, 1944, Chap- 
er VI, 

[4] G. M. Fikhtengolts, Course of Differential and Integral Calculus, 1957, Vol. 1, 
Chapter 1V (in Russian). l 

[5] G. P. Tolstov, Course of Mathematical Analysis, 1954, Vol. 1, Chapter VII 
(in Russian). 

[6] A. O. Gelfond, Calculus of Finite Differences, 1952, Chapter V (in Russian). 

[7] D. A. Ventsel, E. S. Ventsel, Elements of the Theory of Approximate Com- 
putations, 1949, Chapter 3, Sec. 4 (in Russian). 

[8] L. V. Kantorovich, On Newton's Method, 1949 (in Russian). 


11 9615 


Chapter 5 


SPECIAL TECHNIQUES FOR APPROXIMATE 
SOLUTION OF ALGEBRAIC EQUATIONS 


' 5.4 GENERAL PROPERTIES OF ALGEBRAIC EQUATIONS 


Consider the algebraic equaiion of degree n(n > 1) 
P (x) == a x" +a,x7-14...+a,=0 (1) 
where the coefficients a, a,,..., @, are real numbers and 
Ay = 0 

The variable x will be considered complex in the general case. 

Fundamental theorem of algebra. An algebraic equation of the 
nth degree, (1), and, hence, also the polynomial P(x)] has exactly n 
roots, real or complex, provided that each root is counted according 
to tts multiplicity [1], 12]. 

We then say that a root — of equation (1) has multiplicity s 
(that is, € is an s-fold root) if 


P= PB)... = POY (=O, 
POE) gO (2) 


The complex roots of equation (1) have the property of appearing 
in complex conjugate pairs. 


Theorem 1. /f the coefficients of the algebraic equation (1) are real, 
then the complex rvots of this equation are complex conjugate in pairs, 
that is if E=a+ if (œ, B are real) is a root of (1), of multiplicity s, 
then the number &=a—iB is also a root of that equation and.has 
the same multiplicity s 

Note that the moduli of these roots are the same: 


[|= [EI =V 


Coroliary. An algebraic equation of odd degree with real coeffi- 
cients has at least one real root. 

It is easy to give a rough estimate of the moduli of the roots 
of equation (1). 


5,1 General properties of algebraic equations 163 


Theorem 2, Suppose 
A= max {la |, |@,|, >.. 4, |} 


where a, are the coefficients of (1). ` 
Then the moduli of all the roots x, (k=1, ..., n) of (1) satisfy 
the inequality 


A 
el < 14757 (3) 
That is, in the complex plane &On 


(x=€+in) the roots of this equation 
are located inside the circle 


A 
Jx] < ier 
(Fig. 39). 


Proof. Setting |x|>>1, we have 
from formula (1) Fig. 39 





| P (x) | > | aye” |— (Jax | +lagx"?{4+... +a, )) > 
>a, ||eP—A (lett fxr... += 





n—] A 
=|a,||xp-—A er > (layz) ll 
Whence, if $ 
A 
149i —737 F? 
that is, if 
|x|> 1478, (4) 

we find that 


|P(x)|>0 


Thus the values of x which satisfy (4) are definitely not the 
roots of equation (i). Hence, all the roots x, of (1) satisfy the 
reversed inequality 


-© A 
|x,;< i+ jae 


Corollary. Jet a,0 and 
B= max {l al, laib thon: lanai 
Then all the roots x, (k=1, 2, ..., n) of equation (1) satisfy the 
inequality t 
Il > — -=r (5) 


| an | 





164 Ch. 5. Special Techn’s for Approx. Solut. ef Algebr. Eqs. 


that is, the roots of (1) are located in the annulus 


r<|x|<R 
(Fig. 40). 
True enough, for, setting 
ysa 
y 
we have 
1 
P{x)= ga Q O) 
where 





Q (y)=4,y" +4, yw" t+... +a) 


Fig. 40 


The roots n=} (k=l, ..., n) of 
the polynomial Q (y) satisfy, by virtue of our theorem, the inequality 


1 B 
lnl= 7S a 


whence 
1 
]x,] > p =l (k=1, ..., 2) 


Tort 





Note. The numbers r and R are the upper and lower bounds, 
respectively, of the positive roots of equation (1). 

Similarly, the numbers —R and —r serve as the lower and 
upper bounds, respectively, of the negative roots of equation (1). 

If i 


: ear E 
are the roots of (1), then the following expansion holds for the 
left member: 

P (x) = a (xx) ($ —%,).. (xX — xn) (6) 


From this, multiplying out the binomials in (6) and equating the 
coefficients of identical powers of x in the left and right mem- 
bers of (6), we get relations between the roots and the coefficients 
of the algebraic equation: 


ay 
xX, tx +... Sar eer 


a 
XXa + Xk H 6. +X, 1%, = ; (7) 


5,4 General properties of algebraic equations- 165 


The left members of (7) are the sums of combinations of the roots 
of equation (1) taken one at a time, two at a time, etc. 


Example 1. The roots x,, X, x, of the cubic equation 
x? + px? + qx+r=0 
satisfy the conditions 
ntr +x =— p, 
XXa T XX, + XX= q, 
X4X,X,=— Tr 
If multiplicities of the roots are taken into account, the expan- 
sion (6) assumes the form 
see (X — x1): (x — 4g)... (X—X_) %m 
where Xa Xa «++, Xm (m<n) are distinct roots of equation (1) 
and ©,, Oy, +--+, "are their multiplicities, and 
Oj HA +.. Fann 
The derivative P’ (x) is expressed as follows: 
P' (x) = ay (x— 2,7! (Hx). (x — xn) Q (x) 
where Q (x) is a polynomial such that 
Q(x,)40 for k=l, 2, ..., m 
For this reason, polynomial 
R(x) =a, (x — x)! (x— xa). (Xa, mF 
is the greatest’ common divisor of the polynomial P(x) and its 
derivative P’(x). It will be recalled that R(x) can be found by 
Euclid’s algorithm [1]. Forming the quotient 
P (x) 





(x)= 5 (x) 
we get the polynomial 
f(x)= Ax" AvP 24... +A, (8) 
with real coefficients A,=a,, Ap ..., Any the roots of which x,, 


Xas +++, Xm are distinct. 

Thus, the solution of an algebraic aian with multiple roots 
reduces to that of an algebraic equation of lower degree with dis- 
tinct roots. 

The total number of roots x,, x,, .--, xy of the equation 


P (xy =0 


located in the complex plane inside a simple closed contour P 


166 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


(Fig. 41) may be determined on the basis of the principle of the 
argument [4] which consists in the following: if a polynomial 
P(x) has no roots on a closed contour T, then the number of roots 
N of the polynomial inside the contour T is exactly equal to the 

l variation of Arg P (x) in a positive traver- 
r sal of T, divided by 2x; that is, 


{ 
N = zz Ar Arg P (x) 





Each root is counted according to its mul- 
tiplicity. 
If the equation of the contour PF is 
x=E(f}t+in(t) (O<t<T) 


(t a parameter), then, to determine the 


Fig. 44 number N, one constructs in the xy-plane 
a curve 
X=X(t), Y=Y() (O<t<7) (K) 
where 


P(x)=P(§(4) +m (t)) =A (t) + iY (t) 
and X(t), Y(t) are real functions; then one counts the number 
of circuits N that the curve K makes about the origin. 


Example 2. Determine the number of roots of the equation 
P (x)= x®—3x+1=0 (9) 
contained in the circle |x| < 2. 


Solution. Putting 
x=2(cost+isinf) 
we have 
P (x)= 8(cost+isint)®—6(cost+isinft)+1= 
= (8 cos 3f—6 cos £+ 1)+ i (8 sin 3f—6 sin t) 





whence 
X = 8 cos 3t—6 cos t+ 1, (K) 
Y =8 sin3t—6sint 
TABLE 7 
T T m 27 DH 4 
t | 0 Ear ty ty ty ie En 
X | 3 | —4.22 —10 I 15 6.22 —! 





























5.2 Bounds of real roots of algebraic equations 167 


Plotting the curve K (see Table 7), it is easy to see that the 
curve circles the origin three times (Fig. 42). Therefore, N=3 and 
so equation (9) has three roots inside the circle |x| < 2. 


YA 


D 


Fig.. 42 





5.2 THE BOUNDS OF REAL ROOTS 
OF ALGEBRAIC EGUATIONS 


In this section we consider polynomials of the type 
P(x)=a,x" + ax ...+4, (a, Æ 0) (1) 


with real coefficients a,, a,, ..., Ap. Our aim here is to establish 
the limits, as narrow as possible, for the positive and negative 
roots Xis Xa ..., X, (l<m<n) of the equation 


P(x)=0 (2) 


The problem of the existence of these roots is not touched on here, 
It will be noted that one can restrict himself to finding the upper 
limit R of only the positive roots of equations of type (2). Indeed, 
gene with (2) let us consider the auxiliary algebraic equations 


Py (x zi )= 0, 
P, (x) =P (— x)=0, 
P, =e 
and let the upper bounds of their positive roots be R,, R, and R, 


respectively. Then the number x is clearly a lower bound of the- 
1 


168 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


positive roots of equation (2), that is, all positive roots x+ of 
this equation, if they exist, satisfy the inequality 
J xt <R 


Ri * 


Similarly, the numbers —R, and -5 are, respectively, lower and 
3 


upper bounds of the negative roots of (2), that is, all negative 

roots x7 of this equation, if there are any, satisfy the inequality 

1 

TRS SA 

We now give some simple techniques (not all of which are pro- 

vided with proof) for finding the upper bound R of positive roots 
of equation (2). 

Lagrange’s theorem. Suppose a,>0 and a, (k 21) is the first 


of the negative coefficients” of the polynomial P(x). Then for the 
upper bound of the positive roots of (2) we can take the number 


R=14+ V2 3) 


where B is the largest absolute value of the negative coefficients of 
the polynomial P (x). 


Proof. Set x > 1. If in P(x) each of the nonnegative coefficients 
Q ..+, -y is replaced by zero, and each of the remaining coel- 
ficients ag, G@,,1, ..-, Gp is replaced by a negative number —B, 
then polynomial (1) can only diminish its value and we will have 
the inequality : 
xn-kt+it iy 


P (x) > ax” —B (x774 x751 , + l) =a,x"—B 


x—i 


Whence for x >1 we have 











P (x) > dx” a APERE Silai [a,x®-} (x—1)—B] > 
n=k+1 

= [as (x —1)*—B] 
Consequently for 

t /B 

x>1+)/ Zar 
we have 
P(x)>0 


D Tf there is no such coefficient, i.e. ali coefficients of P (x) are nonnegati- 
ve, then P(x) has no positive roots. 


5.3 Method of alternating sums 169 


Thus all the positive roots x+ of equation (2) satisfy the ine- 
quality 
x" <R 


5.3 THE METHOD OF ALTERNATING SUMS 


The idea of Lagrange’s method may be generalized in the fol- 
lowing manner: let a polynomial: P(x) be arranged in descending 
powers of the variable x, its leading coefficient being a, >0. 
Represent P(x) in the form of an alternating sum: 


P (x) = Q, (x)—-Q, (x) + Q; (x) —Q, (x)+... Fin- (x) — Qam (x) 


where Q, (x) is the sum of the successive terms of the polynomial 
P(x) with positive coefficients beginning with ax” and —Q, (x) is 
the sum of the successive terms of the polynomial P (x) with negative 
coefficients, which terms adjoin the terms of the first sum, and so 


on; the last summand, —Q,,,(x), either consists of terms with 
negative coefficients or is identically zero. . 
Denote by c;(j=1, 2, ..., m) positive numbers such that 
Qaj- (C;)— Qo, (e) > (1) 
(j=1, 2, ...,m). Then for the upper bound of the positive roots 
of equation (2) of Sec. 5.2 we can take the number 
R = max (Ci, Ca, o., Ca) (2) 


True enough, set 
Qura W) — Qy lx) bPa + OPI an HOP aoa — 


— b xri — bY xe ats Se eat 
where 7 
bP >0 (s=1, 2, ..., p+q) 
and 


bP >0 (j=1, 2, ..., m) 
Assuming x > 0, we have 
Qoj—1(X)—Qay (x) = XI TPH [àbi xrm BYP xP-F n + bY) — 


bw by) op 
A tea | @ 
From (3) it is evident that the functions Q,;-,(x)—Q,,(*) (j= 
=1, 2, ..., m) increase with increasing x. Hence, for x >c,>0, 


we have 


Qaj -1 (x}— Q; (x) > Qaz-ı (c) — Q; (c;) = 0 


179 Ch. 5. Special Techn’s for Approx. Sclut. of Algebr. Eqs. 


whence for x> R we get 


m 


P (x)= à [Ro (x)-—Q,; (x)] > 9 


Thus all the positive roots x* of equation (2) of Sec. 5.2 satisfy 
the condition 


xS R 
Example, Determine the bounds of the real roots of the equation 
2x — 100x? + 2x —1=0 (4) 


Solution, Here a=2 and A= max (100, 2, 1)= 100. Therefore 
the upper bound R of the positive roots of equation (4) is, by The- 
orem 2 of Sec. 5.1, 


R=145=14P=51 


Applying Lagrange’s theorem and taking into account that 
a=, = —100 and B= max (100, 1)= 100 


we will get a much better estimate for the upper bound of the 


positive roots 
siey a4 50x 


Finally, using the method of alternating sums, we find 
2x° — 100x? = 2x? (x3 — 50) > 0 
for x >j/50 (say, for x > 3.7) and 
2x—1=2(x—) >0 for x > 0.5 
Hence we can take y 
R = max (3.7, 0.5)=3.7 


To determine the lower bound r of the positive roots of (4), set 
gai 
y 


Then equation (4) takes the form 
y — 2y + 100y3—2 = 0 
We successively get 
y —2y =y (y—2)>0 for y>2 
and 
10043 —2 = 100(y—0.02) > 0 for y>0.3 


5.4 Newton’s method e 171 


Consequently 
R,= max (2, 0.3)=2 
and i 
To find the bound of the negative roots in equation (4) put 
x= —Z 
Then 
22° + 10z?+2z4+1=0 (4’) 


Since the coefficients of equation (4’) are positive or zero, this 
equation does not have any positive roots and so the given equa- 
tion (4) does not have any negative roots. 


5.4 NEWTON'S METHOD 
Theorem [Newton]. /f for x=c>0 the polynomial P (x) and all 
its derivatives P’ (x), P” (x), ..., P'™ (x) are nonnegative: 

P¥ (c) 20 (k=0, 1, 2,... n) (1) 
and P™ (c)=n! a > 0, then R =c can be taken for the upper bound 
of the positive roots of the equation 

P (x)=0 (2) 


Proof, With x >c, taking into account inequality (1), we have, 
on the basis of Taylor’s formula, 


P(x)=P () +P’ (0) (x—c) +. HEO 


nl 





(x—c)" > 0 


Hence, all positive roots x* of equation (2) satisiy the inequality 
xtc 
Note. In practical applications of Newton’s theorem, the trial- 


and-error method is used (say, via the Horner scheme) to find 
a monotonic increasing sequence of positive numbers 


LUL SK A h A h 
for which the following inequalities hold true: 
PO) (cy) > 0, 
Ph» (c,) > 0, 
P'(¢,-s)>0, 
P(c.) > 0 


172 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


Such numbers definitely exist since for a, >0 we have 
P™ (x) —> + œ (m=:0, 1, 2, ..., n—1) 
as x— +œ. We can finally take c= c. 


Indeed, since 


P® (x) =n! a, >90 


the function P“~(x) is an increasing function and, hence, for 
x >c, we have 
P-D (x) > PYD (e) SO 
From this inequality it follows that the function P”~* (x) is 
increasing in the interval [c,,4-00) and therefore we get, for 
x > 6,2 Cir 
. p-a (x) > pe~» (c,) > 0 
Reasoning in this fashion consistently, we finally are assured that 
P(x) is an increasing function in the interval fc, + œ) and 
hence for x > C, 2=¢,_, we have 
. P(x) > P(c,)>0 
Which means that xt <c,. 
Example. Consider the equation (given in the example of Sec. 5.3) 
P (x) = 2x5 — 100%? + 2x-— 1 =0 
Here 
P’ (x) = 10x4— 200x + 2, 
P” (x) = 40x*— 200, 
P” (x)= 120x, 
PIY (x) = 240x, 
PY (x) = 240 
Obviously P’’’ (x) >0, PY (x) >0, PY (x) > Ofor x>0. 
We have H l 
P” {(x}=40 (2—5) >0 for x2 


We assume c,=cC,=C,= 2. Since i 
P’ (2)= 10. 16—200-2--2 < 0 
we determine the sign of the number 
P’ (3) = 10.81 — 200-34+2>0 
We can take c, =3. We have 
P (3) = 2-243—100-94+2-3—1<0 


5.5 Number of real roots of a polynomial 173 


and so we compute 
P (4) =2-1024—100.16+2-4—1>0 


Thus, c,=4, and the upper bound of the positive roots of the 
given equation is 


R=4 


The estimate via Newton’s method is more exact than that given 
above on the basis of Lagrange’s method, but less so than the es- 
timate obtained by the meted of alternating sums (see example of 
Sec. 5.3). 


5.5 THE NUMBER OF REAL ROOTS OF A POLYNOMIAL 


After the bounds have been established of the positive and 
negative roots of an algebraic equation 


P (x)=0 (1) 


where P(x) is a given polynomial, the question arises as to the 
number of real roots of the given equation on some known inter- 
val (a, b). 
. A general picture concerning the number of real roots of equa- 
tion (1) on the interval (a, 6) is given by the graph of ie func- 
tion y= P (x) (Fig. 43), where the roots 
Xi» Xa, X, are found as the abscissas -of 
the points of intersection of the graph 
with the x-axis. 

We note the simple peculiarities 
of an integral polynomial. 

(1) If P(a) P(b) <0, then on the 
interval (a, b) there is an odd number 
of roots of P(x), counting multipli- 
cities. 

(2) li P(a)P(b)>0, then on the 
interval (a, b) there are no roots of 
the polynomial P(x) or there is an 
even number of such roots. Fig. 43 

The question of the number of 
real roots of an algebraic equation on a given interval is solved 
completely by the Sturm method [1], [2] 

First Jet us introduce the notion of the number of sign-changes 
in a set of numbers. 


Definition. Suppose we have an ordered finite set of real numbers 
different from zero: 


gy 





E TEETE (n2?) (2) 


174 ‘Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


We say that there is a change of sign for a pair of two successive 
elements Cy, Cy, Of (2) if these elements have opposite signs, that is, 


Clery <0 
and there is no change of sign if the signs are the same; thus 
Celery >O 


The total number of changes of sign in all pairs of successive 
elements cp, ¢,.,(R=1, 2, ..., n—1) of (2) is called the number 
of sign-changes (variations of sign) in (2). 
For a given polynomial P(x), we form the Sturm sequence 
PAS) Pi Py ccs Pal) (3) 


where P, (x)= P’ (x), P, (x) is the remainder, with reversed sign, left 
after the division of the polynomial P (x) by P,(x), P(x) is the re- 
ae with reversed sign, after the division of the polynomial 

P, (x) by P, (x), and so forth. The polynomials P, (x) (k=2, ..., m) 
may be found with the aid of a slightly modified Euclidean algo- 
rithm; if the polynomial P(x) does not have any multiple roots, 
then the last element P,,(x) in the Sturm sequence is a nonzero 
real number. Note that the elements in a Sturm sequence can be 
computed to within a positive numerical factor. 

Denote by N (c) the number of sign-changés in a Sturm se- 
quence for x=c provided that the zero elements of the sequence 
have been crossed out. 

Sturni’s theorem. /f a polynomial P (x) does not have multiple roots 
and P(a)+0, P(b)=40, then the number of its real roots N(a, b) 
on the interval a< x <b is exactly equal to the number of lost 
sign-changes in the Sturm sequence of the polynomial P (x) when 
going from x=a to x=b, that is, 

N (a, b) =N (a)—N (b) . (4) 

Corollary 1. If P (0) 0, then the number N, of positive and , 
the number N_ of negative roots of the polynomial P(x) are 
respectively 
and 

= N(—oco)—N (0) 

Corollary 2. For all roots of a polynomial P(x) of degree n 
to be real, in the absence of multiple roots, it is necessary and 
sufficient that the following condition hold: 

Thus, if 

P (x) =a" Hax ... +4, 


5.6 Theorem of Budan-Fourier 175 


where a, > 0, then all roots of the equation P (x)=0 will be real 
if and only if [i]: (1) the Sturm sequence has a maximum num- 
ag of elements ah that is m=n, and (2) the inequalities 

P,(+00) >O0(k=1, 2, ..., n) hold true; thus the leading coef- 
ficients of all cae functions P, (x) must be positive. 


Exampie. Determine the number of positive and negative roots of 
the equation 
—4x+1=0 (5) 


Solution. The Sturm sequence is of the form 
P ca ae 
—i, 


P, (x) 
P; o=: T 
P(x) = 
N(—oo)=2, N(0)=2, N(+%œ)=0 
Hence, equation (5) has 
N,=2—0=2 


whence 


positive roots and 
N_=2—2=0 
negative roots. And so two roots of (5) are complex roots. 
We can isolate the roots of algebraic equations via the Sturm 
sequence by partitioning the interval (a, b) containing all real 
roots of the equation into a finite number of subintervals (a, B) 


such that 
N (a) —N (B)=1 


"5.6 THE THEOREM OF BUDAN-FOURIER 


Since the construction of a Sturm sequence generally involves 
unwieldy computations, for practical purposes one usually confines 
himseif to simpler particular techniques for counting the number 
of real roots of algebraic equations. 


Let us refine the counting of the number of variations of sign 
in a.sequence of numbers. 


Definition. Suppose we have a finite ordered sequence of real numbers 
Crp loy Sieg Oy (1) 


where c,£0 and cn, 0. 

On the one hand, we use the term lower number of variations of 
sign N of the sequence (1) for the number of sign-changes in an 
appropriate subsequence that does not contain zero elements. 


176 Ch. 5. Special Techn’s for Approx. Solut. of Aigebr. Eqs. 


On the other hand, we use the term upper number of variations of 


sign N of a sequence of numbers (1) for the number of sign-changes 
in the transformed sequence (1) where the zero elements 


Cy = Chp ++ = Cear- = 0 


(C,_, #0, Cga 0) are replaced by elements Ck4i (i=0, 1,2,.. 
1—1) such that 


sgn Cari = (—1) i sgn Chet (2) 

It is obvious that if (1) has no zero elements, then the number 

N of sign-changes in the sequence coincides in meaning with its 
lower NV and upper N numbers of variations of sign: 


N=N=N 
Generally speaking, Ñ z> N, 
Example 1. Determine the lower number and the upper number of 
changes of sign in the sequence 
1, 0, 0, —3, 1 
Solution. Ignoring zeros, we have 
N=? 
To count N by formula (2), form the sequence 
l, —€, &, —3, 1 
where e > 0, whence 2 
N=4 
Theorem (Budan-Fourier). /f the numbers a and b (a< b) are 
not roots of a polynomial P (x) of degree n, then the number N (a, b) 


of real roots of the equation 
P(x)=0 (3) 
lying between a and b is equal to the minimal number AN of 
sign- -changes lost in the sequence of successive derivatives 
P(x), P' (x), ..., P®P (x), P™® (x) (4) 


when going from x=a to x=b, or less than AN by an even 
number: 

N (a, by = AN —2k 
where 

AN=N (a)—N (b) 


and N (a) is the lower number of variations of sign in the se- 
quence (4) for x=a, Ñ (b) is the upper number of variations of 


5.6 Theorem of Budan-Fourier 177 


sign in that sequence for x=} [e=0, Legals E(S)| (see [ 


It is assumed here that each root of equation (3) is ees ac- 
cording to its multiplicity. If the derivatives P™ (x) (k=1, 2,...,n) 
do not vanish at x=a and x=b, then counting the signs is simp- 
lified, namely: 


AN =N (a)—N (b) 
Corollary 1. If AN=0, then there are no real roots of equation 
(3) between a and b. 


Corollary 2. 1f AN=1, then there is exactly one real root of 
equation (3) between a and b. 


Note. To count the number of lost signs AN in sequence (4), we 
form two expansions using Horner’s scheme: 
P(ath=a,ta,hta,f?+...+a,h? (5) 
and 


P (b+ h) = Bo + Bih + B+... + Banh” (6) 


Let N (a) be the lower number of variations of sign of the coef- 


ficients in expansion (5) and, respectively, N (b) the upper number 
of variations of sign of the coefficients in expansion (6). Since 





Pt) Pb 
p= (a) Be pa (b) 


a Al (k=0, 172, 224.99) 


it follows that the signs of the numbers a, and B, coincide with 
the signs of sequence (4) when x =a and x= 6. Therefore 


AN = N (a) —N (b) 
Example 2. Determine the number of real roots of the equation 
P(x) =x — x + 2x—3=0 (7) 


in the interval (0, 2). 


Solution, Here N (0) is clearly the number of variations of sign in 
the sequence of numbers 


—3, 2, —i, 1 
that is, 
N (0) =3 


~ 19 9616 


178 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 
\ 


The expansion of P(2+h) is obtained by means of Horner’s 
scheme: 


Py oe eee (2 


2 2 8 
iiaae] 
2 6 | 
1 3 [0] 
2 
1 | 5 | 


LL 


Hence, N (2) is the number of variations of sign in the sequence 
of numbers 
5, 10, 5, 1 


or N (2)=0. 
From this : 
AN =N (0)—N (2) =38 
Thus, equation (7) has: three real roots or one real root in the 
interval (0, 2). 
Descartes’ rule of signs. The number of positive roots of an al- 
gebraic equation 
P(x) =a" +ax7It+...4+4,=0, (a, = 0) (8) 
a root of multiplicity m being counted as m roots, is equal to the 
number of variations in sign in the sequence of coefficients 
Ajr yy Aar sees Oy (9) 


(where the coefficients equal to zero are not counted) or less than 
that number by an even integer. 

Descartes’ rule is an instance of the application of the Budan- 
Fourier theorem to the interval (0, +00). Indeed, since 

P®) (0)=k! a, _» (k=0, 1, ..., n) 

sequence (9) is, to within positive factors, a collection of deriva- 
tives P™ (0) (k=0,1, 2,..., 2) written in descending order. There- 
fore, the number of variations in sign in the sequence (9) is equal 
to N (0), zero coefficients not being counted. On the other hand, 


the derivatives P®(+-00) (k=0, 1, 2, ..., n) clearly have one 
. and the same sign and, hence, N(+-oc)=0. We therefore have 
AN =N (0)—N (+œ) = N (0) 


5.7 Principle of the method of Lobachevsky-Graeffe 179 


and, on the basis of the Budan-Fourier theorem, the number of 
positive roots of (8) is either equal to AN or is less than AN by 
an even integer. 

Corollary. lf the coefficients of equation (8) are different from 
zero, then the number of negative roots of (8) (counting multiplicities) is 
equal to the number of nonvariations of sign in the sequence (9) 

of its coefficients or is less than that number by an even integer. 
’ The proof of this assertion follows directly from the application 
of Descartes’ rule to the polynomial P (—x). 

Also, let us give a necessary criterion for the real nature of all 
roots of a polynomial. 


Hua’s Theorem. // an equation 
ax" +a,x"~*+a,x"?+...+a,=0 (10) 
has real coefficients and all its roots are real, then the square of 


each nonextreme coefficient of the equation is greater than the pro- 
duct of two adjacent coefficients, that is, we have the inequalities 


ap > agia (R=1, 2, ..., n—1) 
Corollary. If for some & we have the inequality 
Ue Ay ikti . 
then the equation (10) has at least one pair of complex roots. 
Example 3. Determine the composition of the roots of the equation 
x4 + 8x? — 12x? + 104x—20=0 l (11) 


Solution. Since 
(—12)? < 8-104 


it follows that equation (11) has complex roots and, hence, the- 
number of real roots of the equation does not exceed two. In the 
sequence of coefficients of equation (11) there are AN =3 variations 
of sign and AP=1 nonvariation of sign. We thus conclude, on 
the basis of Descartes’ rule and the corollary to it and taking into 
account the presence of complex roots, that equation (11) has one 
positive root, one negative root and a pair of complex roots. 


5.7 THE UNDERLYING PRINCIPLE OF THE METHOD 
OF LOBACHEVSKY-GRAEFFE 
Consider the nth degree algebraic equation 
Ax” + a,x" 1+ ...+a,=0 (1) 


where a,+<0. Suppose that the roots x,, Xa, ..., x, of equation (1) 


180 Ch. 5. Special Techn’s for Approx. Solut, of Algebr. Eqs. 


are such that 
[x {>| x. [>] |... >] 4,1 (2) 


That is, the roots are distinct in modulus and the modulus of any 
one is much more than that of the following one,” In other words, 
we assume that the ratio of any two 
successive roots (in descending order) 
is a quantity small in modulus, i.e. 


Xa = Ex, 
Xa = Fgh, 


where |e} <£ and e is a small qu- 
antity. For the sake of brevity we 
will call them separated roots (Fig. 
Fig. 44 44). 

Now let us take advantage of the 
_relations between the roots and coefficients of equation (1) (Sec. 5.1): 





ay i 

EE a | 

a 
netan e ee } 


dg 


From this, by virtue of the assumptions (3), we get 


i (I+E)=—Z, ) 
*,%,(1+E,)= <2, T 
amon HED | 

where £,, Ep» ..., E, are quantities small in modulus compared 


to unity. Neglecting the quantities E,(A=1, 2, ..., n) in (4), we 


D Jf the coefficients of (1) are real, then from Condition (2) it follows that 
all the roots of (1) are real, 


5.7 Principle of the method of Lobachevsky-Graeffe 181 


get the approximate relations 


So Ki 
a=- p’ | 
x,x,= <2, 
o (5) 
Kitar y= (I) | 
Whence we find the desired roots: 
a 
ys Sa ee 
| ay ’ (6) 
pene 
PO dp- ) 


In other words, if the roots of equation (1) are separated, then they 
are determined approximately by the chain of linear equations 
ax, +a,=0, 
a,x, +a,=0, 


a, 1%, a a, a 0 


The accuracy of these roots depends on how small (in modulus) 
are the quantities e, in the relations (3). 
To separate the roots, use (1) to obtain the transformed equation 


aye y” +a™ yet ies +am—0 (7) 

whose roots y,, Y,..-,¥, are the mth powers of the roots x,, 
Xa <.. X, Of equation (1), that is 

Year (k=l, 2, ..., n) (8) 


If the roots of equation (1), which we assume to be arranged in 
descending order of moduli, are distinct in modulus, the roots of 
equation (7) will be separated if m is a sufficiently high power, 
since 





x m 
#=( k ) —0 as m— oo 
YR-1 Xp-41 


For example, let 
Hee) Heo = ead 


For m= 100 we have 
yy = 1.27.10, y,= 4.06.10", y,=1 


182 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


and, hence, 


#=3.2.107%, 2 =2.5.10718 
Yi Ye 

Ordinarily, for the exponent m one takes a power of the number 2, 
that is, we put m= 2?, where p is a natural number; the trans- 
formation itself is executed in p steps, an equation being formed 
at each step (the roots of the equation are the squares of the roots 
of the preceding equation). 

Approximating the roots y,(k=1, 2, ..., n) from formulas (8), 
we can determine the roots of the original equation (1). The accu- 
racy of the computations depends on the smallness of the ratio of 
moduli of successive roots in the transformed equation. 

The idea of this method for computing roots was suggested by 
Lobachevsky, and a practically convenient scheme of computation 
was advanced by Graeffe. 

The advantage of the Lobachevsky-Graefie method lies in the 
fact that it does not require the roots to be isolated. It is only 
necessary to get rid of multiple roots by the device given in Sec. 5.1. 
The actual computation of the roots is uniform and regular. As we 
will soon see, the method is also suitable for finding complex roots. 
One inconvenience of the method is that it involves large numbers. 
Another is the absence of a sufficiently reliable check on the com- 
putations, and’ there are difficulties in estimating the accuracy of 
the result obtained. 

Note that if the roots of equation (1) are distinct but the mo- 
duli of some of them are nearly equal, the convergence of the Lo- 
bachevsky-Graeffe method is extremely slow. In this case, it is 
advisable to regard such roots as equal in modulus and to apply 
special computational techniques. 


5.8 THE RGOT-SQUARING PROCESS 


We will now show how one can easily set up an equation whose 
roots are the squares (taken with the minus sign) of the roots of 
the given algebraic equation. This latter procedure is used for 
reasons of convenience in order to avoid, as far as possible, the 
appearance of negative coefficients. The transition from roots 
x,(k=1, 2, ..., 7) to the roots 


Y= — Xb (1) 


will for brevity be called root squaring. 
Let 


P (x) =a +a,x""14+ .,.+a,=0 
be the given equation, where a, 0. 


5.8 Root-squaring process 183 


Denoting the roots of this equation by x,, Xa, ..., x, we have. 


P (x)= a, (x—X,) (x—x,) e. (x— xn) 
whence 


P (—x) = (— 1)” a, (x+ x1) (+ xa) -o (H xa) 
Consequently 
P (x) P (—x) = (—1)” ap (x? — x1) (x? — x3) -e (4° — xa) (2) 
Setting 
y= 

we get, by formula (2), the polynomial | 

Q (y) = P (x) P (—*) 
whose roots are the numbers 

Ye=—*xe (k=1, 2, ..., n) 
Since 
P (= x) =(—1)" [axt —a,x"~! +. 4,x"-2—.., +(—1)"a,] 

we have,“after multiplying out the polynomials P(x) and P(—x), 
P (x) P (—x) = (—1)” apa?” — (af —2a,a,) XT? ++ 

+ (a3 —2a,a, + 2a,a,) "~4#—... + (—1)" aj] 
Hence, the equation we are interested in is 


Q (y) = Ay” + Ay" *+Ay e+... +4,=0 
where 
- Ay = 4%; 
A, = a}—2a,a;, 
A, = a3 — 2a,a, + 2a,0,, 


ee ‘b 


We write more compactly 
k 
A=} +2 > (—1) Ganda, (k=0, 1, 2, ..., n) 
s=l 


where it is assumed that a,=0 for s<0 and s>n. 


Rule. In root squaring, each coefficient of the transformed equa- 
tion is equal to the square of the earlier coefficient minus twice the 
product of the adjacent coefficients plus twice the product of the 
next two coefficients, etc. H the required coefficient is absent, it is 
considered equal to zero. 


184 Ch. 5, Special Techn’s for Approx. Solut. of Aigebr. Eqs. 


.5.9 THE LOBACHEVSKY-GRAEFFE METHOD FOR THE CASE 
OF REAL AND DISTINCT ROOTS 
Suppose the roots x,, x, ...,%, Of an nth degree equation with 


real coefficients 
i ax*+ax3+...+a,=0 (1) 


are real and unequal in modulus. Arrange them in order of decreas- 
ing moduli: 
Mal >|e| >... > |4] 


Repeatedly applying the root-squaring process, we form the equation 


boy” +by"1+...+0,=0 (2) 

` whose roots are the numbers 
Ye=—xF (= 1, 2, ..., 1) (3) 
li p is sufficiently great, the roots y,, Ya ..., Y, are separated 


and, on the basis of the results of Sec. 5.7, can be determined 
from the chain of linear equations 





boy, +b, =0, 
by, +b, =0, 
ba-1Yn +O, = 0 
From this we get 
2 ry 
n= n= V gi (ha 1, 2... n) (4) 


The signs of the roots x, are determined by a rough guess, by 
substitution into the given equation, or on the basis of the rela- 
tions between the roots and the coefficients of the equations. The 
process of root squaring usually continues until the doubled pro- 
ducts cease to affect the first main terms of the coefficients of the 
transformed equation. 


Rule. The process of root squaring is terminated tf the coefficients 
of some transformed equation are equal, to within the accuracy of 
the computations, to the squares of the corresponding coefficients of 
the preceding transformed equation due to the absence of doubled 
products. 

Indeed, if the transformed equation corresponding to the power 
2P+1 has the form 


Ce? +e2"7+...+¢,=0 
and the relations 
C= bE (k=0, 1, 2, ..., n) 


5.9 Lobachevsky-Graeffe method for real roots 185 


hold, then we clearly get 


Qp+1 “ee 2p bk 

a= V Š- V ré 

Thus, under the circumstances we cannot improve the accuracy 
of the root computations. 

Since, when applying the Lobachevsky-Graeffe method, the coef- 
ficients of the transformed equations generally grow rapidly, it is 
useful to isolate their orders by writing the coefficients in powers- 
of-ten notation a-10”, where |a|< 10 and m is an integer. It is 
advisable to use logarithms in computations requiring extreme 
accuracy (see [5J). 


Example. Use the Lobachevsky-Graeffe method to find the roots 
of the equation 





®—3x+1=0 (5) 


Solution. The results of the computations carried to four signifi- 
cant digits are tabulatéd in Table 8. 
TABLE 8 
COMPUTATION OF REAL ROOTS BY THE LOBACHEVSKY-GRAEFFE METHOD 


Power x x 


—3 
9 
of 
9 

36 8l 
—18 } =12 } 

18 69 
3.24. 102 4.761-10? 
—1.38. 10? —0.036- 103 
1.86- 10? 4.725103 
3.460.104 2.233.107 

—0.945-108 0 
2.515.104 2.233.107 
6.325- 108 4.986- 1014 

—0.447- 108 0 
5.878.108 4.986. 1014 
3.455. 1047 2.486- 1029: 

—0.010-1017 0 
3.445. 1017 2,486- 1028 
1187-1035 6. 180-1088 

0 0 
1.187. 1038 6, 180-1058 








186 Ch. 5. Special Techn's for Approx. Solut. of Algebr. Eqs, 


Stopping with the 64th power of the roots, we have 


—x' + 3.445. 100 = 0, 
— 3.445. 1017. x81 + 2.486. 1028 = 0, 
—? 486. 102". x84 a 1—0 
whence 
x, = + ° 3.445 107, 


64 2486 sg 
3445 10 , 


64 ] as 
X; = Prier y-a 10 


> 
I 
H 


Taking logarithms we obtain 
logo | x1 |= gq + 17.53719 = 0.27402, 
logy, |x, | = gq- 11.85831 = 0.18528, 
logy | 5 |= gq « (—29.39550) = T.54070 


and, consequently, 


x, = + 1.879, 
x, = + 1.532, 
x, = + 0.347 


In determining the signs of the roots, note that by Descartes’ 
rule, equation (5) has one negative root and two positive roots,” 
and 


X,+X,+x,=0 (6) 
Therefore, the negative root must be the largest in modulus and 
we finally get 
i x, = —1.879, 
X, = 1.532, 
x, = 0 347 i 
Relation (6) holds true within the specified accuracy. By way of 
comparison, we give the exact values of the roots obtained by 
Cardan’s formula: 
xı = 2 cos 160° = —1.87938, 
x, = 2 cos 40° = 1.53208, 
x, = 2 cos 80° = 0.34730 


D We take into account the fact that the eqiation P (x) =x3—3x+1=0 
has positive roots since P (0) >0 and P (1) < 0. 


5.10 Lobachevsky-Graeffe method for compiex roots 187 


It will be noted that in our case the computation of roots was 
somewhat simplified because the extreme coefficients of the equation 
were equal to unity. Generally speaking, when using the Loba- 
chevsky-Graefie method it is advisable first to transform the equa- 
tion so that the leading coefficient is equal to unity and the con- 
stant term is equal to +1 (see {[5]). 


5.10 THE LOBACHEVSKY-GRAEFFE METHOD FOR THE CASE 


OF COMPLEX ROOTS = 
Let us now generalize the concept of separation of roots. 
Suppose the roots x,, Xa ..., *, of the equation 
ax” +tax’I+..,+a,=0 (1) 


satisfy the conditions 
| xy] | |. | Kal D Har | P latl > |n] (2) 


In other words, it is assumed that the roots of equation (1) can 
be divided into two categories 


(groups): 
Xis Xas occ; Xm (MN <n) 
and 


Xm+r Xmter ep ee Xn 


so that the moduli of the roots of 
the first category are very great 
compared with those of the sec- 
ond category (see Fig. 45, where 
the roots lie in the cross-lined 
regions, while the interior of the 
nonlined annulus is free of roots Fig. 45 
and constitutes a “desert area”). 

Write down the first m relations between the roots and the 
coefficients of (1): : 


Rte. thm tnt + ae ae 


ap 





iS a 
XXa FXX F 22. EX Xi tb (mnn t oe +e, 1%) =F 


Neglecting the terms with relatively small moduli (they are in 


188 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 
brackets), we get the approximate relations 
a 
XX r. Hin=— 3 


a 
XXa +X xX, +... Fim- ; 


(3) 


a 
Xakas Xn = (1) 


From this it follows that the roots x,, x,, ..., x, of the first 
category (with large moduli) are approximately the roots of the 
equation 

a,x™tax"14.,,+ta,=0 (4) 


Of the remaining unused n—m relations between the roots and 
coefficients of equation (1), we get 


XiX: sate (mi1 F Mader it ee a) + . 
eM AE E, E E E ne E A E O a “mit : 

Kye Kn (Kepar¥ meat -e H XnmiXn) F Xag. -e Xmt Xmm tb e+: 
e 6 My = a Sate 


XXa: o XB ype My = (-—-1)" 2 


Dropping the relatively small (in modulus) terms, we get the 
approximate relations 


a 
Kaže: Xm (Sms Hims t ++ +X q) = (1), 


a 
Xka. -Xm (%msiXmeet Bray + 4%, -1%n) = eae deere ’ 


Whence, using the last relation in formulas (3), we find 


: Am+1 
Knee Fmi Tee T An = orate 
— m+? 
Xmtičm+ F Ess +n 1% ae oe , l (5) 
®© ee >» » č > č ò č » č ù č E č >ò č S% č E ë s, > ë b 


5.40 Lobachevsky-Graeffe method for complex roots 189 


Consequently, the roots Xma Zmz +++, Xa Of second category 
(with small moduli) are approximately the ‘roots of the equation 
Ap X” T so att. t+ta,=0 (6) 


Thus, under our conditions, equation D decomposes into two 
equations of lower degree, each one of which approximately de- 
termines the roots belonging to one of the categories. 

Arguing by analogy, we conclude that if the roots of (1) can be 
split into p categories 


Kin Ao 280 Bigs 
mtr Em tas KARA Xma» 
x vee 
mp-t Mp—1 +2? , Smg 


(m, +m, + ee +m,=n) 
so that the condition 


[xy |] x, |S = | Xm, [Slee ae] P- | antel = PL al 


© » » o © © s» o v kp n o s pk lll 


> Aapa ti] > Tapara] > - >| xa] 


holds, that is, the moduli of the roots belonging to lower catego- 
ries considerably exceed the: moduli of roots of higher categories 
(we may say that these roots are separated in the group sense), 
then the roots of each category can be determined in approximate 
fashion from the corresponding equations 


a,x Hax. nn +4, = 0, 
An K H am o H Amm = O, 


S<” 


. » » č » çċ > č s ç >è č» çċ » ċē v çċ » ċē >» ċē s ċē > ç ọọ č E çë» ž $ (7) 
Unt mgt tmp- PT my tmgt -tmp pt1” Spats 

` Tam, + my+ stm, =9 
the powers of which are My, M, ..., Mp, respectively. In parti- 


cular, if the roots of (1) are completely separated, then equations 
(7) are linear equations; a pair of complex roots, in the absence 
of other roots of the same modulus,- will be associated with 
a quadratic equation in (7). j 

We consider here only the simplest cases when equation (1), 
whose coefficients are considered to be real, has one pair of comp- 
lex roots or two pairs of complex roots with distinct moduli, the 
moduli of the real roots being distinct and different from the mo- 
duli of the complex roots. More general cases are discussed in 
Krylov [5] and Scarborough [6]. 


190 Ch. 5. Special Techn’s for Approx. Solut. of Aigebr. Eqs. 
5.14 THE CASE OF A PAIR OF COMPLEX ROOTS 
Suppose i 
(1) 


Xa =U to, \ 


Xm+1 = uU—tv 


(u and v real, v0) are complex roots of equation (1) of 
Sec. 5.10, and all the other roots x, (km, ky&m+1) of this 
equation are real and satisfy the condition 


leien e lea Sn es (2) 
Applying the root-squaring process, form the equation 


boy? + by? 1+... +b,=0 
whose roots are ; 


Yr = — X? (k=1, 2, peek n) 


Given a sufficiently large natural p, the real roots y,, ..., Ym- 
Ymir +--> Yn Will be separated with a high degree of accuracy 
and may be determined from the linear equations 


boy +6, =0, 
Dail wig Oat = 0, 
Dasma + mse =O, 


by -1Y4n + 6, =9 
whence we get 
ae a ae 
k 


X= a (kam, RAm+1) 


The root-squaring process is terminated when the doubled pro- 
ducts in the coefficients bi, ..., Oma» Omaiy +--> 0, vanish in the 
next step (to within the specified accuracy). As for the coefficient 
b„, it does not, generally speaking, include vanishing doubled 
products. It may even happen that these products dominate the 
square and the coefficient b, has a changing sign. This is a characte- 
ristic sign of the presence of complex roots or roots with equal 
moduli in equation (1) of Sec. 5.10, and the unusual behaviour of 
the coefficient b, indicates the position of such roots in the se- 
quence (2) of moduli. 

Note that the equation under consideration definitely has com- 
plex roots if the coefficient b, changes sign; thus, ingthe case of 
real roots only, all the coefficients of the transformed equations 
will clearly be nonnegative. 

According to this general theory, the roots y, and y,4,, which 
correspond to the complex roots x, and x,,,,, approximately sa- 


5.41 Case of a pair of complex roots 191 


tisfy the quadratic equation 
Dn 14" F Ony Ona =0 
Note that the coefficient b,, is the middle one. Since 
XpX pay = u? + Y = r? 
where 
r =| xe|=] 4r] 
is the common modulus of the complex roots, and 


Ui Me = (ee ary” 
then, by the property of the roots of a quadratic equation, we 
have 
(Py? = fmt 
m-i 
And from this we determine the square of the modulus of the 
complex roots: 





oP b 

r= V oe (3) 

The easiest way to find the real part u of the complex roots is 
to use the relation 





a 
yd ese Hmi t (Xat Aaa) ee ae | eE 
whence 
1a =, ay Ww 
20 = tnt 4n = — >. Xp 
sx ni 
kæÆæm+i 
and, consequently, 
oe ay l U ss 
U= — a P E (4) 
km 
 kẹæm+l 


Knowing, by virtue of formula (3), the common modulus r of the 
complex roots, we find the coefficient v of their imaginary part: 


v=V =u (5) 
Using formulas (4) and (5), determine the desired complex roots: 
Xm, mei 5U E iU 


It is also possible to seek complex roots in trigonometric form: 


Xm, mar =T (cosp + ising) 


192 


Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


Example. Find the roots of the equation [7] 


x4 + 43 — 10x — 


34x —26=—0 


(6) 


Solution, The results of the computations, to four significant di- 
gits, are given in Table 9. 


TABLE 9 


COMPUTING COMPLEX ROOTS BY THE LOBACHEVSKY-GRAEFFE METHOD 





Power 





1} l 
- 1 
20 } 
a) 4 21 
441 
—239 | 


209 


2.380. 104 


4.368- 104 \ 


6.748. 104 
4.554-109 
—0.078- 10° 


4,476-108 
2.003- 1019 
0.002. 1019 


2.005: 1029 
4.020: 1038 
0 


64 1 4.020. 1088 





x? 


—10 
100 
68 \ 
—so f 
116 
1.346 
—2.671: 
0.135+ 
—1.190- 
1.416. 
—1.035- 
0.009- 
3.90- 10? 


+104 


104 
108 
108 


y 


10: $ 
104 7 


\ 


1.521-1015 3 
—9.748- 1015 ? 


0 


—8 .227-1 016 
6.768. 1082 
—4, a 1031 


. 2,113.10% 

4.465. 1082 

—1.084- 1083 
0 


—6 .38- 1062 


7 


\ 
f 
\ 
fi 





E 
108 j 


—34 ` 
1156 
—520 


636 
4.045.105 
—1.568. 105 


2.477. 
6.135. 
1.088. 


108 
1618 
| 
1010 
1021 
1023 \ 
5, 200- 


1021 
2.704 


-1043 
0 | 


2.704. 1043 
7.312. 1086 
0 


7.223. 
5.216. 
—0.016- 


7 312-1086 





x 


—26 


676 


4.570. 105 


2.088.104 


4.360- 1022 


1.901 - 1045 


3.614.109% 





From Table 9 it is clear that in n fifth transformed equation 
(with 32nd powers of the roots, 2è = 
(in descending order of moduli) are separated. These roots can be 
found from the two-term equations 
—x? + 2.005- 10 =0, 
—2.704. 102x32 + 1.901. 10% =0 


32), the real roots x, and x, 


- 5.11 Case of a pair of complex roots 193 


whence 


32/5 NE Tip 1.901 
= + °9/2.005- 10", poe, ERT 


Taking logarithms, we have 
bel 413 19. 30211 = 0.60319, 


logo |x |= gp (2: 27898 — 0.43201) = 0.05772 
Hence 
= +4.010, x= Pere. 


A rough guess convinces us that the root x, is positive and the 
root x, is negative. We thus finally get 


x,= 4.010, X,= —1.142 
Since the transformed coefficient of x? changes sign, the given 


equation has complex roots x=x, and x=x, which are found 
from the three-term equation 


2.005 - 1029y? + 2.113-10%1y+ 2.704 - 10% =0 
where 
y= —y?? 


By the general theory, the modulus of the roots 
r=|x]=]|x] 


is found from formula (3): 
r=") Fo I 


log, r? = ay (24.43201 —0.3021 1) = 0.75406 


whence 


and, therefore, 
' r? = 5.6763 
Setting 
X =u +iv, x,=u—iv 
we get from 
f ‘ x, +x%,+4,+%,=—1 
the relation 
w= (—1—4,010 + 1.142) = —1.934 


9% coefficient of the imaginary part v is found from the formula 


v= VP—@=V 5.6763 3.7404 = V 1.9359 = 1.395 


13 9616 


194 Ch. 5. Special Techn’s for Approx. Solut. of Aigebr. Eqs. 


Hence 
Xp, = — 1.934 + 1.395% 


Note that the roots x, and x, may also be found from the rela- 
tions between the roots and the coefficients of equation (6); name- 
ly, we have 

XH+ Xx + x= a]l, 
X1X2X3%; = —26 
and from this, using the values of x, and x, found above, we get 
Xa + X, = — 3.869, 
XX; = 5.677 
And so x, and x, may be found as the roots of the quadratic 
equation 
x? + 3.869x + 5.677 =0 


whose solution yields 
X2, = — 1.934 + 1.3917 


5.12 THE CASE OF TWO PAIRS OF COMPLEX ROOTS 


Suppose equation (1) of Sec. 5.10 admits two pairs of complex 


roots: 

X= Uy Wy, X44, = Uy — id, 
and 

Xp = Uat Wy, Xt = Uy — id, 


with distinct moduli (t, 0,, Us, V, real and v, 0, v,=40); all 
other roots x,(j Ak, jAR+1, jm, jm+1) of this equation 
are real, distinct in absolute value, nonzero (zero roots can be 
isolated beforehand), and different from the complex roots in mo- 
dulus; that is, 

pal> pel >... > [xr] > xe] = Kra] oa |a= 

=| Xp ee > a) > 0 (1) 
As usual, performing the root-squaring process in the equation at 
hand up to some power 27, we get the transiormed equation 
boyy” +O,y + ...+b,=9 
whose roots are the numbers 
y=] (FHI, 2. n) 


For a sufficiently large p, it will be seen that in passing to the 
power 274? some of the coefficients c; of the newly transformed 


5.12 Case of two pairs of complex roots 195 


equation 
i C2” +a.. He 
will be (to within the specified accuracy) squares of the correspon- 


ding coefficients b; of the preceding transformed equation. Under 
our assumption (1), we finally get 


c;=b} for j=0, 1, 2, ...,k—l, k+1,...,m—1,m+],...,n 
and 
Cp Of and c,~« 6b}, 


This enables us to establish the position of the complex roots. 
Note that a change in the signs of the coefficients b, and 0,, for 
different exponents 2? serves as a sufficient criterion of the pre- 
sence of two pairs of complex roots of equation (1) of Sec. 5.10. 

The real roots x, of the equation under consideration are de- 
termined from the two-term equations 


l — bjx?” +b;=0 (2) 
whence 


at VL +k, PARAL fem jAm+]) 
fe ae yee 


The complex roots x,, x,,, and Xm, %m4, are found respectively 
from the three-term equations 


bpi? t —b,x*” + By, = 0 (2’) 
and 
bye? —b x? 46, =0 (2”) 
Let us introduce the notation 
y=] x| =| Xai] 
and 
ra= | 4ml =] 4ml 
Noting that 
ri= Xkčk+1 


and 


2 
ry = Aik m4 


we can compute from equations (2’) and (2”) the squares of the 
moduli of the complex roots: 


2p beat. 2? fE 
z— k+1 Bi m+1 
p= TS and rž ym 


To determine the real parts u, and u, of the complex roots, use 





196 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 


the relations between the roots and coefficients of equation (1) of 
Sec. 5.10. We have 


ag ee 
Nghe Ky t XyXg-- Xq tiie. E e y= (1) ae 


and 


Qa 
XiX. a= a 


Dividing the first equation by the second, we get 
1, 1 1 an- 
Bia ee = ae, 


an 





Besides, 
Pe em erie ae 


a 
whence, taking into account the relations 
Xe F Xeyr Xm t Xm = 2u; aa 2u, 








and 
1 w 2u 
L+ + — 4 m+1 rè n 
we have the following linear system of equations: 
Ss 
H F a= — 9 yO 
ty ye mai ly 2 
re terè 2an 2 


where o is the sum of the real roots and o’ is the sum of the re- 
ciprocals of the real roots: i 
g= , xX: 
ree m+ d 
and 
7 l 
o= — 
jek, a, myi 7 
Finding u, and u, from system (3), we determine the coefficients 
v, and v, of the imaginary parts of the complex roots from the 


formulas. 
v= V r-u, v, = Vi 
We finally get 
Xr, pty = My + IY, 


and 
Xm, m+1 = Uy E iD, 


5.12 Case of two pairs of complex roots 


197 


Example. Using the Lobachevsky-Graeffe method, solve the equa- 


tion [7] 
xt 4x — 3x+3=0 


(4) 


Solution. Applying the root-squaring process up to the 16th po- 
wer and carrying the result to four significant digits, we obtain 


the results as given in Table 10. 


TABLE 10 


COMPUTING TWO PAIRS OF COMPLEX ROOTS 
BY THE LOBACHEVSKY-GRAEFFE METHOD 














Power x4 xe x? x | xe 
ae ae 0 4 a 3 
0 16 9 
25 0 \ —24 \ 
6 
2 | 1 —8 22 —15 9 
64 484 \ 225 
—44 —240 | —396 
is f 
4 jı 20 262 —171 81 
4.102 | 6.864-10% } 2.924.108 
—5 24-10 f 0.684.108 | | 4.244.104 
0.016.101 f 
8 | 1 | —1.24-10 7.564-10¢ | —1,320..104 6.561 - 108 
ee 5.723-10% \ 1.743-108 | 
—15.128-104 Í | —0.003- 10° ; 9.927.108 f 
o . 
16 | 1 | —1.359.108 5.720-108 | 8.184.108 4.305- 107 


It is readily seen that in the next transformation the middle 
coefficient will be equal to the square of the earlier value, and 
so we stop the root-squaring process. Since for the 16th power 
there are two negative coefficients among the coefficients of the 
transformed equation, equation (4) admits two pairs of complex 


roots: 


and 


Xi, = Uy tly 


Xs, a= U, + iv, 


which, respectively, satisfy the three-term equations 
x? -4 1.359. 105. x18 +-5.720- °° 


198 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 
and 
5.720-10°- x32-+ 8.184- 10°. x18 + 4.305.107 —0 
From this we determine the squares of the- moduli of these roots: 
r? = ¥/5.720- 10" = 4.072 


and 
6 $3057 
ra V ESR - 107 = 0.7367 
Since 


+, = 0.2456, 4 = 1.3574 
mW r 


it follows, on the basis of system (3), that to find the real parts 
u, and u, we have the system 
u,+u,=0, 
0.2456u, + 1.3574u, = 0.5 
whence 
u, = —0.4497, 
u, = 0.4497 
Using the squares r? and rë of the moduli of the roots, we deter- 
mine the coefficients v, and v, of the imaginary parts of the roots: 
v,=V u= 1.967, 
v,=V ri—ut = 0.731 
Thus, the roots of equation (4) are of the form 
X1,,=—0.450 + 1.967% 


and 
Xz, a = 0.450 + 0.7312 


5.13 BERNOULLI’S METHOD 


Suppose we have an algebraic equation 


AX” + a,x?) H- or +4, = 0 (a, = 0) (1) 
whose roots x,, x,, ..., %, are distinct. 
On the basis of the coefficients a, (k=0, 1, ..., n) we const- 


ruct a so-called difference equation 
BYn si Ynti t ++» + Oy; = 9 (i= 0, 1, 2, ...) (2) 


which is a recurrence relation relating n+1 arbitrary successive 
terms of the nonterminating sequence 


Ho Yo Yo oe Yip ees (3) 


5.13 Bernoulli's method 199 


The sequence (3) y,=/(i) ({=0, 1, 2, ...) whose terms satisfy 
the difference equation (2) is called the solution of the equation, 
To construct a solution yp it is sufficient to specify n initial va- 
lues Yor Yir veo Ynors the remaining terms y,, Y,41, ..- can be 
found in a step-by-step manner from equation (2). 

Proof is given [8] in the theory of finite differences that if the 
roots x,, X, ..., x, of an algebraic equation (1) are distinct, then 
any solution of the diference equation (2) is of the form 


y;=Cyxi Cit... C €=-01,2...) (9 


where C,, C,, ..., C, are arbitrary constants. Thus, equation (1) 
is the characteristic equation of (2). The constants C,, C,, ..., C, 
can be found from the initial conditions: 


E ea 
Cia t Catat oo + Cnn (5) 


E HC + + Cen 


Theorem, Let the algebraic equation (1) have a unique maximum- 
modulus root x,. Then the ratio of two successive terms y,,, and y; 
of the solution of the difference equation (2) tends (generally speak- 
ing) to a limit equal to x,; that is, 


lim SH =x, (6) 

Proof. Let 
paj >[x, |... 2%)! (7) 
Assuming the roots x, (k= 1, 2,..., 2) to be distinct, we get from 


formula (4) 


n= (C,+0,(2)'+...46, (2) | 


ee (H 
orala aa (8) 
oe oe Je) 


and 
Y= | [c it, (3) "4 


whence 
(8) 


If C 40, then, passing to the limit in (8) as ico and not- 
ing that by virtue of inequality (7) the limit relations 


(=o n (Be 


200 Ch. 5. Special Techn’s for Approx. Solut. ef Algebr. Eqs. 


hold, we will have 


lim Yit1 


=X 
inn Yi i 


Note 1. If in an inept choice of solution it appears that C,=0 
and C,Æ0, then the limit (6) will be equal to the next largest 
(in modulus) root of equation (1). 


Note 2, If for the solution y; the ratio oscillates without 


Yi+d 

Yi. 
tending to a limit, then we may suspect that (1) has complex 
roots which are largest in modulus. 


Note 3, Making the change of variable 


1 
x= 
z 


in (1), it is possible, using Bernoulli’s method, to find the least- 
modulus nonzero root of equation (1). 

Thus, as an approximation to the maximum-modulus root x, 
of equation (1), we can use the formula 


4,425 
Yi-1 


‘where i is sufficiently large. 

In a practical application of the Bernoulli method, one can 
specify arbitrary numbers 4, Yı, ---, Y,-, and then, using the 
formula 


] P 
Yn i= g Yi On Yin + oe a + O,Y4n+j-1) (i= 0, l, 2, Seam) 


compute the sequence of numbers Yp, Yaris Ynya --- and the ra- 
tios Ye , ti Ynte | If, as i increases, the ratio 7+- 
Yn-1 Yn Yrn+1 Yn+i~1 
exhibits a tendency to approach some number &, the latter is ta- 
ken as the maximum-modulus root x, of equation (1), otherwise 
it is extremely possible that the equation (1) has several maxi- 
mum-modulus roots, or (less probable) that the coefficient C,=0 
for the initial sequence of numbers Yo, Yı «++, Yn- 
If a crude value œ of the largest, in modulus, root x, is known, 
then it is advantageous, in order to accelerate the convergence — of 
the process, to put 


Yul, Y= ..., Yn, = a" 


Note that the Bernoulli method reduces to a repetitive sequ- 
ence of operations of the same type and therefore is very suitable 
for machine computation. 


5.13 Bernoulli's method 201 


The initial values y; (i=0, 1, ...,n— l) can, generally speak- 
nee : taken in arbitrary fashion. One ordinarily takes y,=y,=... 
=0; y,.=1. eae [9] has suggested choosing y; 

so ee ‘all the coefficients C, in (4) are equal to unity. In that 


case, the process —— a definitely converges as i -— œ provided there 


is a unique eam: -modulus root of equation (1). 
The Bernoulli method can also be used to compute the complex 
roots of equation (1) [10]. 


Example. Find the maximum-modulus root x, of the equation 


x5 + 5x#—5=0 
Solution. The appropriate difference equation is of the form 
Yi+5= 95 (Yi— Yisa) (i=0, Lee; ++) (9) 


In arbitrary fashion we take the values 

y=0, ¥,=09, y, = 0, Ys=9, Yy=l 
By formula (9) we compute the values of y; for i 3> 5. These values 
are listed in Table 11. 


TABLE 11 


. FINDING THE ROOTS OF AN ALGEBRAIC EQUATION 
BY THE BERNOULLI METHOD 








; r ug ; n Yi 
t | Yi | Yi-a : i Yi Fiar 
5 —5 —5 10 15,575 —4,992 
6 25 —5 11 —77,750 —4.928 
T —125 —5 12 388,125 —4,99196 
8 625 —5 13 —! ,937 , 500 —4,991948 
9 —3120 —4.992 








Terminating with y,,, we have 


yia 1,937,500 


whence, taking y,, into account, we can approximately put 
x, = — 4.99195 


In conclusion, it is worth noting that new methods of solving 
algebraic equations have recently appeared with convenient com- 
putational schemes (Lin’s method, the method of N. V, Paluver, 
and others) [10]. 


202 Ch. 5. Special Techn’s for Approx. Solut. of Algebr. Eqs. 
REFERENCES FOR CHAPTER 5 


{i} A. G. Kurosk, Course of Higher Algebra (translated from the Russian, 
Mir Publishers, 1972), Chapters 7 and 8. 

{2] G. M. Shapiro, Higher Algebra, 1938, Chapters III and VI (in Russian). 

[8] D. Grave, Elements of Higher Algebra, 1914, Chapter X (in Russian), 

[4] B. A. Fuks and B. V. Shabat, Functions of a Complex Variable, 1949, 
Chapter VII (in Russian). 

{5] A. N. Krylov, Lectures on Approximate Computations, 1933, Chapter II 
(in Russian). 

[6] J. B. Scarborough, Numerical Mathematical Analysis, 1955, Chapter X, 

[7] B. K. Mlodzeeusky, Solution of Numerical Equations, 1924, Chapter 1V 
(in Russian). i 

[8] A. O. Gelfond, Calculus of Finite Differences, 1952, Chapter V (in Russian). 

[9] F. B. Hildebrand, Introduction to Numerical Analysis, 1956. 

[10] V. L. Zaguskin, Handbook on Numerical Methods of Solving Algebraic and 
Transcendental Equations, 1960 (in Russian). 


Chapter 6 
ACCELERATING THE CONVERGENCE OF SERIES 


6.1 ACCELERATING THE CONVERGENCE 
OF NUMERICAL SERIES 
We say that the series 
ata, +... +a t.n (1) 


converges slowly if we have to take a large number of terms of the 
series in order to obtain the sum to the required degree of accu- 
racy. For instance, suppose we have to find the sum of the series 


S=} Pain -t ae (2) 
to within 1076. For the nth remainder of the series we have 
Ruz (ei 


n 


i 
n 


Thus, our accuracy will be assured if we take the sum of 1,000,000 
terms of the series, but this is impossible in any practical sense. 
Therefore, in solving this problem we regard the series (2) as a 
slowly convergent series. 

To find the sum directly of a slowly convergent series to a 
specified accuracy e is, generally speaking, an arduous task or 
practically impossible. Of importance, therefore, are transformations 
of the series which accelerate the convergence. We shall examine 
here the Kummer transformation [3], [4], which will be found to 
be useful in a number of cases. 

Let the series (1) converge and let the sum be A. We choose 
an auxiliary convergent series 


btb, H. Hote a0) (2') 
the sum of which, B, is known, a series such that 
lim =q 0 (3) 


flow “n 


204 Ch, 6. Accelerating the Convergence of Series 


Then we have the obvious equation 


a an =q 5 +d (a, —qb,) 


n=l n=l n=l 
or 


A=qB+ 2 (a, — qb) (4) 


In particular, if a, ~ b„,, then q=1 and we have 
A=B+ È, (a,—6,) (4’) 
n=l - 


Thus, finding the sum of the series (1) is replaced, in the general 
case, by finding the sum of the series 


È, (a — ag») 6) 


The remainder of the series (5), Ry, may be written as 


R= © a= © (1a) X e 


n= m n=N+1 n=N+1 
where e, =1—q2 0 as n => oO. 


For this reason, in the genera] case, the series (5) converges 
faster than the original series (1). The main diffculty in applying 
the Kummer transformation consists in choosing a suitable auxiliary 
series (2’). 

We demonstrate the application of this transformation for the 
positive series (1) whose terms a, are rational functions of an 
integral variable n; that is, 

~ a +... +p 
On = Bont EBT EB 


where p and q are nonnegative integers and a, > 0, B, > 0. For 
convergence of a series with general term (6), it is necessary and 
- sufficient, that the inequality g2=p+2 hold true. 


In this case 
eat (x) D 


D We say that a, is an infinitesimal of order not less than m with respect 


(n=1, 2, ...) (6) 


(at least!). 


to—: 
n 


6.1 Accelerating the convergence of numerical series 205 
Consider the auxiliary series 


a m ` 1 Panes 
S H nagh afm (=l, eo) (7) 





Since 


1 1 1 
n(atl):.-(a-m) m | nia-i)...f#+m—l) 


1 
TEE era | 
it follows that 


N 
m l o a= 
S =2 n(atl)...(a-+m) 
ou 1 I 
=la N FDN E2)... (NEm) ] 
Hence 
m] mo 1! 
Si Jim SP my ® 


Utilizing Stirling’s idea, represent the general term of the series 
as defined by formula (6) in the form of a finite sum of inverse 
factorials 

An 


= Ay A, , È n) 
n= rni na H nFA e t at). my TA" 


where A,, A,, ..., A, are undetermined coefficients and a¥” is the 
remainder term. Select the coefficients A; (i=1, 2, ..., m) so that 


1 
aR” = O (z=) 


amo (A) 


if 
lim he £ 0 
now 2 
2 


` If ¢ £0, then æ, is an infinitesimal of order exactly m with respect to - : 


206 Ch. 6. Accelerating the Convergence of Series 


For this purpose it suffices to determine the coefficients A, suc- 
cessively from the formulas 


A, = lim n(n+I)a,, | 


' (9) 


m=i 
i A; 
A,= lim a ae ee (a+ m) | 


In accordance with the general scheme, we take for the auxiliary 
series (2) 


o & A, a8. A, An = 
B= hob, -2\; e tae eee 


A A A 
AS? HAS? H.. +A,SP= Hts... +52 (10) 


m-m| 
It is obvious that 








and 
wo 


= Di a,= B+ 2 a (11) 


n=] 
D 


Since the rapid convergence of the supplementary series 2 am 


is, generally speaking, revealed only for a sufficiently large n, it 
is convenient in a practical sense fo perform the indicated trans- 
formation beginning with some term a,,, of the series. Assuming 


p Ce] oad 
S= = a, + >; a,= Sp + > a, 
m=p+) n=p+l 


n=} 
we have 


A, An Sls, 
ee as 3 Tee era aaa a 


n=p 





] A,X 1 1 
-544 $ (=z) TT [er emer] E 


An ~ 1 l es 
e a D oa ee aa ba a” = 


n=p+ n=p+l 


6.4 Accelerating the convergence of numerical series 207 


p i, A l 
te ae : 


An, Pare ere (m) 
si PFI.. FA + an” 


n=p+t 


In particular, as m-+ oo, we obtain Stirling’s expansion (taking 
into account that a{ = 0): 


z k 1 A 1 
eae a,+ Ast o pT 
An 1 
i +a BENE) pm) T Ki 
Exampie. Find the sum of the series 


J 
S= ap (12) 





to within 0.001. 
Solution. Setting 
E N E ENT 
mE nny nahar 
we have 


A, = lim note =, 


i l . —1 2 
A, = lim [an rp] +1) (2+ 2)= lim fe Oe) ee =) 


Hence 
1 1 1 at 
“Rol nath a(a+lh(m+2) 
_ Be 43n? +2n— n— 2n*®—n—2—n?—1 n—3 
n(n-+1) (2+2) (n? +1) ~ nfin +1) (242) (n? +1) 
On the basis of formulas (10) and (1!) we get 
1 2 n—3 : 
S=rr teat Laeryeraw rn (12%) 


(2) 
n 


Since for n 23 we have | 
n—3 1 
RAL HMDS Se 
it follows that 


ao 


eye 2 EEEN amen | E- = +. 0.001 
+1 


n= 


208 Ch. 6. Accelerating the Convergence of Series 


And from this it follows that the number of terms in the sum 
(12’) may be taken N=10; these summands must be computed to 
four decimal places in the narrow sense. We thus have 


S = 1.25 + (—0.1667) + (—0.0083) + 0 + 0.0005 + 0.0004 + 
+ 0.0002 + 0.0002 + 0.0001 -+ 0.0001 + 0.0001 = 1.0766 
Noting that the sum of the first four summands is exact, for the 
absolute error of a result we a the estimate 
A< 2 1073-47 ae 10-4 < 0.7-1078 


Rounding, we find 
S = 1.077 
with the limiting absolute error 
A=0.7.10°§+0.4-1073=1.1-1078 

Note that for the remainder of the given series (12) we have 

the estimate 
dx l 1 

Rex | ar< EEE 


_whence N >> 2000, which means that without the transformation 
we would need about 2000 terms of the series to attain the same 
accuracy. 


Note. We could also use the following series for an approximate 
computation of the sum of the series (1) with general term (6): 


l A l “1 xs 
Lene Lama 2a as 396 ete 


Generally speaking, 


als 
Sls 








Ee ze 1 Bap (2n)PP 
ka (2p)! 


where B, (n=1, 2, ...) are Bernoulli numbers [5], [6] defined by 
the symbolic formula 


(B+1}—B"=0 


in which, after expanding by the binomial theorem, we put B” = B,. 
In particular, we have 


Bt B,=—_, B55 B=— z5, Bo= g 
(see Sec. 16.11). 


6.2 Accelerating the convergence of power series 209 
6.2 ACCELERATING THE CONVERGENCE OF POWER SERIES 
BY THE EULER-ABEL METHOD 


Consider the convergent power series 


(= Days” a) 


where f(x) is the sum of the series. 
Let the radius of convergence R of series (1) be finite and non- 
zero. Without loss of generality, we can take it that R=1.” 
Write the series (1) as 


f (x)= a, + x@ (x) (2) 
where 


p (x)= pa) a, xt = x ee (3) 
Multiplying both sides of (3) by the binomial 1—x, we get 
E T (x)= > O,+ 1x" — > pris TE (4) 


Assuming n+ 1=m in the second sum and noting that the sum 
does not depend on the summation index, we have 


Š aam = $ ay Dar 


Therefore 


aes Yaw È aew. 


@ 
PA Oy 41 — Up) X" = Oy + $ dae 


where 
Aa, = Qp — An (n=90, 1, 2, ...) 


are finite differences of the first order of the coefficients a, (for 
more on finite differences see Sec. 14.1). Thus from formulas (3) 
and (4) we derive 


=a Anak? = =! ft È aae 





D Indeed, if 0< R < œ and R #1, then, assuming = , We get a power 


series in the variable £ with radius of convergence p=1, 


14 9616 


210 Ch. 6. Accelerating the Convergence of Series 











and, hence, 
f(x)=a,4+ e+ =e Aa,x” 
that is 
È astm +S Aa P 
n= n= 


This transformation of a power series is called the Euler-Abel 
transformation. Analogously, applying the Euler-Abel transforma- 


tion to the power series $ Aa,x”, we find 
n=0 


yA a,x" = = fe ee 
n=O 











where 
A*a, = A (Aa,) = Aa, .,—~Ag,, 


are finite differences of the second order of the coefficients a,, whence, 
on the basis of formula (5), we get 


ao w 
Doret tE D se) 


* neo 
Ay xAay x \2 <- 
=t) È Aa, 


Repeating the Euler-Abel transformation. p times in succession, we 
obtain 








[ze] oo 








n 20 xAay xP-1AP -iag PEY Apy yn 
Y ax E: zb TT sT (1—x)P + -J & Ara, x 
n=0 n=O 
where 


APa, = AP a, .,—A? 3a, (n=0, K 2, sie) 


are finite differences of the pth order of the coefficients a,, and 
A*a, (k=0, 1, 2, ...) are the successive finite differences of the 
coefficients a, for n=0. Thus 


where A°a,=a,. Formula (6) is advantageously used when the finite 
differences A?a, are of higher order of decay, as n — oo, than the 


6.2 Accelerating the convergence of power series 211 


coefficients a„. This occurs frequently. For example, if d,=> , then 
we get 


That is, Aa, decreases faster than a, as n— œ. 
In particular, if a,=P(n), where P (n) is an integral polyno- 
mial of degree p—1, then formula (6) yields the sum of the series 


w p-1 


2 P (n) x” = pa SOF emer a EI) (7) 


n=0 


in closed form, since A?P(n)=0. 

Formula (6) becomes meaningless when x=1. The Euler-Abel 
transformation may be modified to accommodate this case. Setting 
x=— t, we have 


ao 


i (x)= _ = È (— D'a," = 


tk. w 


= k /t n n 
PS ly Glee o pret. +Y aan [(— 1)"a,] t 





Returning to the earlier variable, we obtain 


p-l k 
= _. FAR n x 
HE Lm WI We dene ae 


Zz 


+(7)’ x (— 1)"*? AP [ = yan] x” (8) 


Formula (8) is meaningful for x=1 as well. 


Example 4. Find the sum of tne series 


ive] 


= È een atey (9) 


to within 0.001 when x= —-1, 


2142 - Ch. 6. Accelerating the Convergence of Series 


Solution., Apply the Euler transformation twice (p=2) to get 


l 
"= GED TD’ 
l l _ 
(n+ 2) Ore. (n+l) (n+ 2) ~~ 


a 





Aa, = Gy4177 a = 


~ nat) om (n+3) ’ 


Ata, = Aa, A= GET H RE E 
6 








+ EEN GED OTS ~ GENET OTD TS 
Hence 


From this, on the basis of formula (6), we get 


6 ees 
eA 1) > E (= 
1 1 3 l 3 l 3 
e a a at 
1 3 l 3 1 
+Z: * Sond TD o0 uo) 


The series (10) is an alternating series with terms decreasing 
monotonically in modulus. Therefore, if we stop with the term 


3 l l 


2 ° 3024 2016 


then the remainder R of the series will not exceed (in modulus) 
the first discarded term: 


ARIS $a = ae <3 1074 


Thus, taking two extra digits, we have 


F (—1) = 0.25000 + 0.08333 -L 0.06250 — 0.01250 + 
-+ 0.00417 —0.00179 -+ 0.00089 — 0.00050 = 0.38610 


with absolute error 


A< 5 10-5 +.3-10-4& < 4-1074 


6.3 Estimates of Fourier coefficients 213 


Rounding this number to three decimals, we get the approximate 
value Ff (—1)= 0.386 with limiting absolute error 


A< 4-107464 1-1071= 5: 1078 


The exact value of the sum is 
f(—1)= 2 In2— 1 = 0.38630... 
It will be noted that if one computed the number f(—1) di- 


rectly, using the series (9), roughly forty-five terms of the series 
would be needed to attain the required accuracy. 


Example 2. Find the sum of the series 


S (x)= È (nm? +n-+1) x" 


Solution. We have 
Pinh=r+nat+l 
Construct Table 12. : 
TABLE 12 
TABLE OF FINITE DIFFERENCES 











n P (n) AP (n) A?P (n) 
l 2 2 

l 3 4 
7 


Formula (7) yields- 


l 2x 2x? 
S@= let Goat Toa 
for |x|<L. 


6.3 ESTIMATES OF FOURIER COEFFICIENTS 


The Fourier trigonometric series of a given function 
f(x)\(—n<x< nj” is the series i 


“+S (a, cosnx+b, sinna) (1) 


n=l 


D For the sake of simplicity of formulation we consider the function defined 
on the interval [— n, m=]. The general case of the function ọ (#) defined on the 
interval [a, b] may be reduced to ours by means of the linear substitution 

b—a 
2m 





_ bta 
Sonar aoa 


214 Ch. 6, Accelerating the Convergence of Series 


the coefficients a,, b, of which [Fourier coefficients of the function 
f (x)] are computed from the formulas 
T 


a,=— È f(x)cosnedx  (n=0, 1, ...), (2) 


b = 


{å 


f(x) sinnxdx (AS 1p 2, ...) (2) 


aj- 
amoa a 


A sufficient condition for the existence of a Fourier series of 
a function f(x) is the integrability of the function on the interval 
[— a, m]. In this case, the Fourier coefficients (2) and (2’) have 
definite finite values, 

It may happen that the resulting Fourier series diverges or con- 
verges to a different function. We give without proof [1], [7].the 
_ conditions under which a Fourier trigonometric series converges 
to a function f(x) at all points of continuity of the function. 


Convergence theorem. /f a function i(x) is piecewise continuous 
and piecewise differentiable on the interval [— n, n], then its Fou- 
rier series converges on the whole number axis and its sum S(x) is 
a periodic function, with period 2n, equal to 


S (xo) = e ot) (3) 


at any point x,€(— 7, n) and S (2 n) = 27! [f (— a +0) + f(a — 0)]. 

In particular, S(x,)=/(x,) if the function is continuous at the 
point x=x, that is, if f(x, — 0)= fF (*,+0)=f (x) 

If, besides, the function f(x) is periodic with period 2x, then 
its Fourier series converges for every value x, and has the sum (3). 

If the conditions of the convergence theorem are fulfilled, then 
it is obvious that a, — 0 and 6,-+0 as n— œ. We give more 
exact estimates of the Fourier coefficients by imposing certain 
restrictions on the behaviour of the function F (x). 


Definition. We say that a function 1(x) specified on the interval 
[— n, x] belongs to periodicity class C'™ if: 

(1) f(x) is continuous on |— n, x] together with its derivatives 
up to order m inclusive; 

(2) (8 (— n4 0)= f (a—0) for k=O, 1, 2, ...,m, that is, 
at the endpoints of [—n, n] the values of the function (x) must 
coincide with its first m derivatives. 

From Conditions (1) and (2) it follows that a periodic continua- 
tion of f (x) belongs to the class C™ (— œ, +œ). 


„Lemma. /f a function f(x) belongs to periodicity class C'™ on 


ihe interval |— n, n} (more briefly, i(x)EC'™[-—n, m]), then its 
Fourier coefficients a, and b, are infinitesimals, as n— 00, of order 


6.3 Estimates of Fourier coefficients 245 
higher than m with respect to a that is, 


ao (ges tere (a5) 


Proof, Integrate the right members of the following equations 
by parts m, times: 


a, = f(x)cosnxdx  (n=0,1, ...), (4) 
1 h £ r 
b=- È Fo sinnxdx (n=1,2, ...) (4’) 


Putting u=f(x) and dv= cos nxdx, we find du= f (x) dx and 
v=- sinnx. Thus, by the formula for integration by parts, we have 


l 


n ™ 
> @,=— [+ f (x) sin rz] tt F (x) sin nx dx = 
: -n -x 


=+ f F (x) cos (Zax) dx 


Applying integration by parts once again and noting that 
F (— n=} (nx), we get 


a, = — fir (x)sin (5 7 +nx) eet (f"x) cos( $24 a= 


=} | F” (x) cos (5 2-+nx) dx 


and so forth. 
After an m-fold integration by parts we have, in formulas (4) 
and (4’), 
™ 


a, = 1 f F™® (x} cos (4. m+ nx ) dx g 


-7 





n> œ l 


1) The notation a,=o0 (=) means that lim Ln =0, 
ne 


216. Ch. 6. Accelerating the Convergence of Series 


Similarly 


b, = = f f™ (x) sin (+ -m+ nx) dx 


-7 





The integrals 


o 1l 
e= 


f\ (x) cos (+ m+ nx) dx 


l 
iea 


and 


1 
T 


i 
gia 


7 (x) sin (+ y'm +nx) dx 


are, to within sign, the Fourier coefficients of the continuous (by 
hypothesis) function f°” (x). As is well known, the Fourier coef- 
ficients of a continuous function tend to zero as their numbers 
increase without bound, irrespective of whether its Fourier series 
converges or not. Therefore 


&,— +0 and e,—0 as n— oœ 
But since 


En 
a=, and. b, = 


m 


it follows that the Fourier coefficients a, and b, of the function 
f(x) are infinitesimals of higher order than As 


nm 


amo) be-0(K) 


A. N. Krylov used this result as the basis for a method of acce- 
lerating the convergence of the Fourier series. 


Note. If F™ (x) satisfies the conditions of the convergence theo- 
rem, then it is easy to prove that 


s@ (+) and s,=0 (+) 


1) This follows from the fact that for any pee continuous function f (x) 


with Fourier coefficients a, and b,(n=0, 1, 2, ...) the Bessel inequality [7] 
"x 


` l 
LE x (az + bp) a a (x) dx is valid. Hence the series > ath) 


n=l - n= 
converges and a, 0, bp —> "0 asnm—> ©: 





6.4 Accelerating convergence by Kryiov’s method 217 


In this case, a better estimate is obtained for the Fourier ‘coeffi- 
cients of the function f (x) 


ee and b,— 0 (ar) 


6.4 ACCELERATING THE CONVERGENCE OF FOURIER 
TRIGONOMETRIC SERIES BY THE METHOD OF A. N. KRYLOV 


Suppose a function f(x) is piecewise continuous and has piece- 
wise continuous eae f? (x) i= 2, , m) up to the mth 
order inclusive on the interval [—n, x]. Then by virtue of the 
convergence theorem of Sec, 6.3 the function f(x) can be repre- 
sented as a Fourier trigonometric series at all its points of conti- 
nuity: 


aa È a, cos nx 4-b, sin nx) (1) 


where a, and b, are the Fourier coefficients defined by formulas (2) 
and (2’) of Sec. 6.3. In the general case, the coefficients a, and b, 
of the series (1) slowly approach zero; it is hard to use this series 
in any practical sense, all the more so is it inadmissible to diffe- 
tentiate the series (1) term by term, yet this is required in the 
solution of certain problems, in particular those involving the 
Fourier method. 

The underlying idea of Krylov’s method [8] is that one takes 
out of the function f(x) an elementary function g(x) (ordinarily 
a piecewise polynomial function) having the same discontinuities 
as f(x); its derivatives g® (x) (i=1, 2, ..., m) up to the mth 
order inclusive have the very same discontinuities as the corres- 
ponding derivatives f(x) of the given function f(x) and, what is 
more, g(x) is such that 


F? (—n +0) —g" (—n-+ 0) =f (n—0)—g" (n—0) 
(i=0, 1, 2, ..., m) 
In that case the difference 
p(x) =f (x)—g (x) 


will belong to periodicity class C™, 
Denoting the Fourier coefficients of the function p(x) by a, 
and B, (n=0, 1, 2, ...), we get 


i= otg S (en cos ne +B, sins) | (2) 


n=] 


where a, and B, are infinitesimals, as n—— oo, of order higher 


218 Ch. 6, Accelerating the Convergence of Series 


than m with respect to that is, the series (2) will, generally 
speaking, be a rapidly convergent series. This series can be diffe- 
rentiated term by term at least m—2 times. 

We give a practical demonstration of how the aioe func- 
tion g(x) is constructed from the given function f(x) [9]. To do 
this, on the interval [—2n, 2x} we construct recursively a sequence 


of functions o6,(x), o (x), ..., 6,,(*) having the property 

o (+0) 07? (—0) = 0 (3) 
(k=O, 1, 2, ..., m) and such that the derivatives of? (x) (j=0, 
l, k— l) are continuous on [—2n, 2m]. 


We define the function 6, (x) in the following manner: 


 —A— Xx 


2 





for --Q2n<ix <0, 


k == for O<x< Qn, (4) 
0 for x=—2n, 0, 27 





Its graph is shown in Fig. 46. This function is odd and so its’ 
Fourier series contains only the sines of multiple arcs: 














wo 
= È b sinnx 
n=} 
whete 
2° 2 n of 
WX mux cos nx = 
b, == 5 5 Sia ae a (ar | cosnxar)= 
2 T 1 
-i(= —sxsinnx|t )=4 
Hence 
sinx see sinfi 
0) (x)= ++. HH (5) 


It is obvious that the function o,(%) has a discontinuity at the 


6.4 Accelerating convergence by Krylov’s method 219 
point x=0 with a jump equal to x: 
o(+0)—9,(—0) =F—(—F) =n 
and so the function 
p(x) =0, (4—4) (ISS n, =n Kr Kn) 
has the same jump at the point x, as the function o, (x): 
p(x, +0)—p(x,—O)=a 


The point of discontinuity is the only one on the interval [—x, x]. 
We define the function o, (x) by the formula - 


6,(x)=¢,+ | 0, (x)dx (6) 
0 


where c, is a constant. 
Integrating the series (5) termwise, we obtain 


x @ ive] we 
sin nx 1 cos nx 
a ()=aqt) D Edt pA (7) 
ó 7=1 n=l n=] 


n> 





We choose c, so that the constant term of series (7) is zero: 


Ms 
oo 


G+ =0 


n=l 


l 


whence 


l 
ne 


Mes 


C= 


n=l 


The series pa is clearly the constant term of the Fourier 
n=} 
x 


series of the function \ o, (x)dx. From this, using the formula (4), 
0 


we have 





Therefore 


229 Ch. 6, Accelerating the Convergence of Series 

















Hence 
o, (x) = ani coi iiz (8) 

and 7 

( nx mom (mn — xP , 

| [ 5 dx — =p 1 for O<x<2n, 

Fa i 2 2 2 
| -7a x— ae _ tH) for —2nxix<0 
3 


The graph of the function o, (x) is shown in Fig. 47. The func- 
tion o, (x) is continuous on the interval [—2x, 2n] but its deri- 





vative, o,(%)=0,(x), has a discontinuity at the point x=0, and 
0; (+0)—o(—0)=n 


The following functions are defined in the same way: 
x 


0, (x)= f 0, (x) dx-+c,, 


0 


o (x) = f 0, (X) dX + C5, 
d 
Onm (x) 5 f On 1 (x) dX + Cm 
0 
where the arbitrary constants Ci, C, ..., Ca are chosen so that 


the constant term of the corresponding Fourier series is zero; that 
is, the constants cp (k=1, 2, ..., m) are successively found from 


the conditions 
f |a f Op; (%) a| dx=0 
6 6 


6.4 Accelerating convergence by Krylov’s method 224 


The functions o,(x) (k=1, 2, ..., m) and all the derivatives 
up to (k—1)th order inclusive are continuous on the interval 
[—2n, 2m]. Here, since of” (x)=0,(x), it follows that 


of (+0)—of(—0)=n (k=l, 2, ,.., my 


that is, the derivative of kth order of the function o,(x) has a 
discontinuity at x =O with jump n. Whence it follows that the 
function wp, (x)= 0,(x—x,)(~—n<x<m) obtained by shifting the 
function o,(x) has a discontinuity of only the kth derivative at 
the point x= x): z 


jo EA +0)— py (xa —0)= 


Now let , 
XO, x, aaa X be points of discontinuity of f(x), 
“ya 1) (1) i i inyi 7 
xy, xP, ..., XÆ be points of discontinuity of F (x), 
xu xg ., be points of discontinuity of f™ (x 


Note that some of these points may be repeated. 
We introduce the following notations for the corresponding 
jumps of the function and its derivatives: 
fo (xi + 0)—f") (xP — 0) = = hy 
(hae Dy ac m PS le oe og k) 
We define the function g(x) (jump function) by the formula 


s=k, 


h® AG 
PME at - o, (XA) 
s=} 








s=km 


Bern Br oy (x—¥2) (9) 





The function g(x) has the following properties: i 

(1) at the points xj”,.*j%, ..., x4 it has discontinuities, and 
the jumps at these points are equal to the jumps of the function 
f(x) at the corresponding points: 


Lp? 
B(x? + 0)—g (x —0) = fo a + 0)— 
(0) 


h 
—o,(x,—x,—0)] == 





Ea 0 
m= hi 


222 Ch, 6. Accelerating the Convergence of Series 


(2) the derivative g(x} (/=1, 2, ..., m) is discontinuous at 
the points xP, x, ..., Xe) also 
it) 


gh) (xP Es 0)— gH (xi? —0) = fe lo, (x? — xP + 0)— 


ay 
hj n= hP 
n i 


—6,(xP—x? —0)] = 
that is, 
g” (x;4+0)—g (x) +0) = FY (x; 4-0) — fF" (x,—0) 
(3) for x=4x‘ the function g(x) has continuous derivatives of 
all orders. 
Suppose 
p (x) =F (x)—8 (x) (10) 
By virtue of the first and second properties it follows that 
ep? (xP +.0)— (xP —0)=0 (1=0, 1, 2, ..., m) 
that is, 2 
p(xyEC™[—x, n] 
Thus, the rapidly convergent Fourier series (2) can be used to 
expand the function f(x). Note that by using the expansions 


: ie A (0) 
o,(x—x) = ¥ Sune Xs 2 
n=l 

[e] 


(D) 
x—x 
A = _— pea. 
n-l1 
Dp 
o sinn (x — xe”) 
0, (x— 4) = — Dy es, 


n=l 


. e. e soa s ‘o ‘o ‘e ‘o ‘l y ‘d ‘do 


it is easy to write the Fourier expansion of the function g(x). 
What we obtain finally is that the Fourier series of the function 
F(x) consists of: (a) a slowly convergent part which is summable 
in elementary terms to the function g(x), and (b) a rapidly con- 
vergent remainder which is the Fourier series of the function 


gp (x) € cr [> n, n}. 

Note. Ji the limiting values of the function f(x) or of its deri- 
vatives F (x), ..., FP (x) (k< m) do not coincide at the endpoints 
of the interval [~ 7, n], i.e., 

FP (— a +0) fH (n—O0) (=0, 1, 2. :.., k) 
then the points x=— n and x=mx are to be regarded as points 
of discontinuity of f(x) or, respectively, of the derivatives f(x). 


6.4 Accelerating convergence by Krylov’s method 223 


Assuming that f(x) is periodically continued beyond the limits 
of the interval [~ n, x] with period 2x, we find that the jump 
of the derivatives at the points x=— n and x= a is one and the 
same and is equal to 


AD = flO (— n+ 0}— f(x — 0) 
By the periodicity of the function o, (x) we have 
0, (x-4+m)=0,(x—2) 


On the interval [—x, a], the function of? (x+n) admits two 
discontinuity points (v=—a and x=x) with one and the same 
jump 1. And so we have to include only one endpoint, say x= — n; 
in formula (9). Indeed, by (9), the jump in the derivative g(x) 
at the point x=— n is equal to 


go (—n+0)—g (—n—0) == fo (+0)—o0 (—0)] =A 


By the periodicity of g(x), this derivative has the same jump 
at x= as well. Consequently, when forming the difference 


1 (x)—g (x) = 9 (x) 


where only the point x=—a is taken into account, the disconti- 
nuity of the /th derivative of the function @(x) is removed both 
at the point x=— n and at x=n. 


Example. Using Krylov’s method, accelerate the convergence of 
the Fourier series of the function (Fig. 48a) 
i=] X+! for —a<x<0, 
x for O<x<an 





Sera ae eee L EE saotel 


Te en eae E 





2 


A 





Der rer ene 


-7 g 
Fig. 48ə Fig. 48b 


224 Ch. 6. Accelerating the Convergence of Series 


Solution. By virtue of the note, the function f(x) has points of 
discontinuity on the interval [—n, n]: x,=—n, x,=0, x,=n. 
Computing the Fourier coefficients, we obtain 

2 
2m2 4 —>— for n odd, 
a=l+ >, a, =z (lY, | on 
‘ 3 i 0 for n even 
Hence, the Fourier series of the function f(x) has the form 








F= ttt > = cos nx — pes rsin (2k + 1)x (11) 
The convergence of (11) is poor since the coefficients b, =O 7 
decrease slowly. From the function f(x) we isolate the jump func- 

tion g(x) so that p(x)= [f(x)—g(#)JeC™ [—a, n]. 
Let us compute the jumps Aj of zero kind at the points 
x,(j=1, 2, 3): 
A =f (—x+0)—} (n—0) = (7 +1) —n=1, 
A? =f (+0)—f(—0)=0—-1=—1, 
h® = Aw” =] 
On the basis of formula (9) and taking into account the note, 
we get i re 
g (x)= - 0, (x+n)—1- o, (x) 


or 
l — J l 
ae ae n o a 
for —a <x <0 and 
1 = 1 — 1 
ensi Hei 
for OS x <m. í 
Subtracting from f (x) the jump function g (x), we get the function 
Í 
Pp (x)= + 


which is continuous on the interval [—x, n] (Fig. 48b). Since 


wo 


sin nx 
5,(x)= >) SaF 


n=l 


and 


0,(x-+2)= > irety y mu nx 


n=t n=l 


6.5 Trigonometric approximation 225 


it follows that 


` oO © 
1l (—=1)} 1 es 7 
g(x)=— 2 —; sin nt—a 2 = sinnx = 
bw (—1—1 2 sin (2k-+-1 
=> SS innra 2 ee 
ee dad n ST, lear! 2k+1 


Hence - 





Fo =g E io" cos Ax 
n=] 


ne 


and the coefficients of the transformed Fourier series. have the 
1 
order of decay o(a) 


Note that if from f(x) we isolate the jump function g(x) to 
within the discontinuities of the derivative, then the remainder 
will be identically zero; that is, we get the exact sum of the 
series (10). 


Note, The method of A. N. Krylov is also applicable to Fourier 


series of period T=2/. Indeed, let a function f(x) be given in 


the main domain a—/<x<a-+tl. Performing the linear transfor- 
mation 


x=atlt 


we obtain the function F (f) = Flatt) of period 2% defined in 
the standard domain —a < t <7. 


6.5 TRIGONOMETRIC APPROXIMATION 


Suppose we have a convergent trigonometric series 


Dd (a, cos nx-+-b, sin nx) = S(x) (1) 
n=0 i 


whose sum is S(x) and not known. It is required to compute this 
sum to a preassigned degree of accuracy. 

‘It is obvious that the faster the coefficients a, and 6, of (1) 
tend to zero, the smaller the number of terms we have to take 
to ensure the given accuracy. Therefore, it is best to accelerate 
the convergence of the series before computing the sum. The usual 
procedure is as follows: from the giver series is extracted some 
trigonometric series, whose sum g(x) is known, such that the re- 


15 9616 


226 Ch. 6. Accelerating the Convergence of Series 
maining series 
ao 
3 ° 
> (a, cos nx +B, sin nx) (2) 
n=0 


is more rapidly convergent than the original series. 
li 


g(x)= > (a, cos nx +b, sin nx) 
then 
S(x)=g(x)4+ Žo, cos nx + P, sin nx) (3) 


where 
@, = ly, — Ün (n=0, 1, 2, ...) 


In the simplest cases, we can use the earlier considered expan- 
sions for constructing the function g (x): 








sinnx n—x 
E Tog) (0<x< 2n), 


See ig Saw pja E a (O<x <2n), 








n? 4 
n=1 
ak 2y 3n - 
DEn = —o, (s SEE OKeh), 


Also useful are the following expansions [7] 


Pa SA —In (2 sin $) 0<x< 2n), 


n=l 





i m — fin In(2sin 3 5) ax (OO<x<2n), 
2 ð 


x 


> ease S In (2sin 2) a4 4 (0<x<2n) 
ð 


> n=} 





œ 


where $. 4 =1.202056903.... 


n=l 


6.5 Trigonometric approximation 227 
Example. Find the sum of the series 
S(x) = >) ~~ sinn 
x ne] 
to within 0.001. 





Solution. The coefficients bn = aE] of the series have an order 
of decay o(-) since lim (bn: x) =1. Let us accelerate the con- 


vergence of the given series. It is clear that 


at n i l act i 4 
nl =a id ee A 














l 
I+ 
where 
n I I I 
Ta peel as EPA) 
Then . 
[el @ M wo i [ee] 
n è P SIN 1x Sin nx : 
5 e+ sin n=% —a dh +d Vn SIN AX 
n=l n=l n=} n=l 
But 
bed . = . 
Sin ax Sin ax 
L =" =0,(%) and SY = —o, (x) 
n=l n=l 
Thus 


S (x) = 9, (x) + 0, (x) + £ y, Sin nx 
n=l 


1 l 
where n= = lF) 
Let N be the number of terms of the series 3 yn sinax which 


n=l 
must be taken so that the remainder Ry satisfies the inequality 


|Rv|= < 0.001 








Gad 
5 Ya SİN NX 
N+1 


Let us find the number N. We have 

<7 1 o da. l 
<Lia<\S-m 
N+1 


N 





. 1 ; 
S a Sinn 
3 (p2 
i De ea) 


gyr < 0.001 we find that N=5 suffices. 





Solving the inequality 


228 Ch. 6. Accelerating the Convergence of Series 


Hence, to the given accuracy we have 
5 


= One — 3n? + x8 i 
S(t StL ae (0<x <n) 





n=) 


REFERENCES FOR CHAPTER 6 


{1] G. M. Fikhtengolts, Principles of Mathematical Analysis, 1956, Vol.. II, 
Chapters XV and XXIV (in Russian). 

[2] A. Markov, Calculus of ‘Finite Differences, 1911, Chapter II (in Russian). 

[3] G. Salekhov, Calculation of Series, 1955, Chapters I and II (in Russian). 

[4] Ya. S. Bezikovich, Calculus of Finite Differences, 1939, Chapter IX (in Rus- 
sian). 7 : 

[5] A. a Gelfond, Calculus of Finite Differences, 1952, Chapter IV (in Russian). 

[6] Vallée-Poussin, C. J de la, Cours d'Analyse Infinitesimale, Vol. I1, 1921. 

[7] G. P, Tolstov, Fourier Series, 1951, Chapters I-V (in Russian). 

[8] A. N. Krylov, Lectures on Approximate Computations, 1954, Chapter V 
(in Russian). 

[9] L. V. Kantorovich, V. 1. Krylov, Approximate Methods of Higher Analysis, 
1949, Chapter I (in Russian). 


Chapter 7 
MATRIX ALGEBRA 


7.1 BASIC DEFINITIONS 


A set mn of numbers (real or complex) arranged in a rectah- 
gular array.of m rows and n columns 


[Oy A. Gs > “2 Gn a 
a a a eee <2 
Aa| Mt x s an (1) 
L am Omz ans ne SAS. Omn J 


is called a matrix (of numbers). The rows and columns of (1) are 
termed the lines of the matrix, 

The numbers a; (i=1, 2, ; j=1,2,..., n) that comprise 
the given matrix are called the AT (or entries) of the matrix. 
Here, the first subscript i denotes the number of the row of the 
element and the second subscript j denotes the number of its 
column. 

A matrix, say (1), is often more compactly written as 


A= [a;)} (i=lwe2, ..., m j=l, 2, ..., 7) 
or 
= [Qim n 
We say -that the matrix A 7 dimensions mxn, or that A is an 
m-by-n matrix (or an mxn matrix, or is of type mxn). 

If m=n, the matrix is called a square matrix of order n. i mn, 
then it is a rectangular matrix. A 1xn .matrix is called a row 
vector and an mxl matrix, a column vector. An ordinary number 
(scalar) may be regarded as a 1x1 matrix. A square matrix of 
the form 


a, '0 0 0 

. | 0Oa,0... 0 

Wel Sa sated: a 
000...4, 


is termed diagonal and is briefly denoted as [a,, @,, -.-» %q]- 


230 Ch. 7. Matrix Algebra 


If w,=1(i=1, 2, ..., n), the matrix (2) is called a unit matrix 
and is denoted by the letter E (or /); thus, 


rt od 
0 


mm © 
QO ©& 


eo 8 » s. e > 


Or E 


Introducing a symbol called the Kronecker delta, 


5/0 Ean 
. Y àl dif i=j 
we can write 


A matrix with all elements zero is called a zero matrix and is 
denoted by 0. To indicate the number of rows and columns of a 
zero matrix, one writes 0,,,. f 

With a square matrix A= [a;;] n,n is associated a so-called 
determinant: 


Qi Qiz +++ Qin 
T E E, 
a a a, 


ni FER: ai nn 


These are two distinct concepts: a matrix is an ordered set of 
numbers written in the form of a rectangular array; its determi- 
nant, det A, is a number that is determined by means of specific 
rules, namely: k 


det A= D (—1)* anaon -Ana (3) 

(4, Gar ++. On? W 
where the summation (3) is taken over all permutations (a,, Œg, -. -, On) 
of the elements 1, 2, ..., n, and, consequently, contains n! sum- 


mands; x=0 if the permutation is even and x=1 if the permu- 
tation is odd. i 


7.2 OPERATIONS INVOLVING MATRICES 


A, EQUALITY OF MATRICES 


_ Two matrices A = [a;;] and B= [b;;] are considered equal, A = B, 
if they have the same dimensions, which is to say, if they have 
- the same number of rows and columns, and the corresponding 


7.2 Operations involving matrices 231 


elements are equal, thus, 


B. THE SUM AND DIFFERENCE OF MATRICES 


The sum of two matrices A= [|a;} and B= [b;;} of the same 
dimensions is a matrix C=[c;,] of the same dimensions with 
elements c;; equal to the sums ‘of the corresponding elements aj, 
and b;; of ‘the matrices A and B; that is, c;;,=a,,+6,,. Thus, 


Tant ba Qia + bis eo Qin Oia + 
Qat On Aat ba -oe Aant Orn 


L am F Om am T Ons oe < Gant Pan | 
“The following properties are derived directly from the definition 
of a matrix sum: 
(1) pa ese a 
(2) A BA, 
(3) yee =A, l 
The difference of two matrices is defined analogously: 


lan — byn a — by, Pate Ain — Vrn 


A+ B= 


A—B=— Ay — ba aa — Oa» s on — ban 


-o e i 


L Oni Oms Bing Digs sone Qn Onn J 


C. MULTIPLICATION OF A MATRIX BY A SCALAR 


The product of a matrix A=[a,,] by a scalar œ (or the product 
of a scalar by a matrix A) is a matrix whose elements are obtained 
by MNP ARE all the elements of A by the scalar a; that is, 


Cig? Qaz -> Ulin | 
PERE oe Aaa Qly -ee Alon 
Ulam lm Qam 


From the definition of the product of a scalar by a matrix follow 
directly the properties: 

(1) 1A=A, i 

(2) 0A =0, 

(3) a (BA) = (a) A, - 


232 Ch. 7. Matrix Algebra 


(4) (a+f)A=aA+ BA, 
(5) a(A+B)=aA+aB. 
Here, A and B are matrices, and œ and P are scalars. 
Note that if matrix A is square of order n, then 
detaA=a" det A 


The matrix 
—A=(—l1)A 
is called the negative (additive inverse) of A. It is easy to see that 
if A and B have the same dimensions, then 
A—B= A+ (—B) 


D. MULTIPLICATION OF MATRICES ; ad 
Suppose 
f Aa Qe Qin | 
A= Qor Oye! Ban 
L Omi Cm Amn 
and 
y ba Di. 1g 
pa A 
i Dox Das eaa bhg 


are matrices of dimensions m xn and p XxX ¢, respectively. If the 
number of columns of A is equal to the number of rows of B, 
that is, 


n=p (1) 
then for these matrices is defined a product, matrix C of dimen- 


sions m x q: 


Ciu Ce Cig 
C= Cor Cag Cog 
Cmi Cmo $ Cmg 


where 
Ci = anbi + Giba + er +40, (i= l, 2, Beer di j= 1, 2, ie a- q) 


From the definition follows the rule for multiplication of matri- 
ces: fo obtain the element in the ith row and jth column of the pro- 


7.2 Operations involving matrices 233 


duct of two matrices, multiply the elements of the ith row of the first 
matrix by the corresponding elements of the jth column of the second 
and add the products. 

The product AB is meaningful if and only if the matrix A con- 
tains as many elements in the rows as there are elements in the 
columns of matrix B. In particular, it is only possible to multiply 
square matrices of the same order. 


Example 1, 
pra 3 2 8 1 l 

1 —4 0 3 

fs 2 

B= 1 —3 
POs - ele 

3 q 

le | 

AB= 


ee 3-(—1)+2-(—3)+8-1+41-1 |= 
124+ (—4)-140-043-3 1-(—1)+(—4)-(—3)+0-14+3-1] 


fll 0 
“17 14 
Example 2. 


ae ame f1 1-142-2+3-3 14] ° 
4 5 | |- 4-1+5-24+6-3 |= = 
7 8 9 L3 7-148-249-3 ‘50 


A matrix product has the following properties: 
(1) A(BC)=(AB)C, (3) (A+B)C=AC+ BC, 
(2) a (AB)=(aA)B, (4) C(A+B)=CA+CB 
where A, B and C are matrices and œ is a scalar. 
Equations (1) to (4) are to be understood in the sense that if 
one of their members exists, then the other member also exists, 
and they are equal. 


The product of two matrices is generally noncommutative, 
AB =< BA, as witness the examples. 


Example 3. 1 9 
5 6 
a=]; if s- |; | 


19 22 23 34 
= A= 
ne o n Š l a 


which is to say, AB BA, 


Then 


234 Ch. 7. Matrix Algebra 


What is more, it may happen that the product of two matrices ` 
in a given order is meaningful while the product of the same mat- 
rices in the reverse order is quite meaningless. 

For example, if 


3 2 1 
fell 27 Bele 
4 3 0 


19 13 | 


then 


A == 
2 P 31 19 


but BA does not exist. 

When AB = BA, the matrices A and B are said to be commutative. 
Thus, for example, as is readily seen, the unit matrix E is com- 
mutative with any square matrix A of the same order, and 


AE=EA=A 
Thus the unit matrix E plays the role of identity (unity) in mul- 
tiplication. 
lf A and B are square matrices of the same order, then 


det (AB) = det (BA) = det A- det B 


This formula follows from the rule for multiplying determinants. 
To illustrate, for the matrices given in Example 3 we have 


19 22) |l 2115 6 
43 50| 3 4]|7 8 


23 34) |1 2/15 6 
31 46) |3 4}|7 8 
7.3 THE TRANSPOSE OF A MATRIX 


If in an m xn matrix 


Oy, Oig e- Gy 
A= Az Cae srry Qon 


. . èrko» 8 o 











and 


Lam amz a Amn 


we replace the rows by the columns, we get what is called the 
transpose of A: 

Qi Gage es Aar 

A = AT= Aig Gog a pt Se Qing 


7.3 Transpose of a matrix 235 


of dimensions n x m. In particular, the transpose of the row vector 


a= |0,.¢,- 2.6 a] 
is the column vector 
a, 
a, 
a= 
Linj 


The transpose of a matrix has the following properties: 
(1) the transpose of a transpose is the original matrix: 


A" = (A'Y aes A 
(2) the transpose of a sum is equal to the sum of the transpo- 
sed matrices of the summands, i.e. 
(A+ By = = A’ + B’ 
(3) the transpose of a product is equal to the product of the 
transposes of the factors taken in reverse order: 
(ABY = B’A’ 


Indeed, the element of the ith row and jth column of the mat- 
rix (ABY is equal to the element of the jth row and ith column 
of the matrix AB: 


Qabri t ajabar +... F O jnbni 


This expression is obviously the sum of the products of the 
elements of the ith row of matrix B’ by the corresponding elements 
of the jth column of matrix A’; that is to say, it is equal to the 
common element of the matrix B'A’. If A is square, then, clearly,. 


det A’ = det A 


The matrix A=[a;;] is called a symmetric matrix if it coinci- 
des with its transpose, that is, if 


ASA (1) 


From equation (1) it follows that: (1) a symmetric matrix is square 
(m=n) and (2) the elements symmetric about the principal dia- 
gonal are equal, or 


a a 


fi hij 


The product 
C= AA’ 


236 Ch. 7. Matrix Algebra 


is obviously a symmetric matrix, since 
= (AA')'=(A')! A'=AA=C 
For example, 


1 2 ajja s| [P243 1-442-543-67_ 
b 5 5 3 6 Se ee 4 + 52 +6 


[l4 32 
132 77 
7.4 THE INVERSE MATRIX 


Definition 1. The inverse of a given matrix is a matrix such that, 
when multiplied on the right (postmultiplied) or on the left (pre- 
multiplied) by the given matrix, yields the unit matrix. 

We denote the. inverse of matrix A by A`!. Then, by defini- 


tion, we have 
AA“=AJA=E (1) 
where E is the unit matrix. 
Finding the inverse of a given matrix is called matrix inversion. 
Definiiion 2. A square matrix is termed nonsingular if the deter- 
minant is different from zero, otherwise it is called a singular 
matrix, 
Theorem. Every nonsingular matrix has an inverse. 
Proof. Suppose we have a nonsingular matrix of order n: 
A Qie ee Ay” 
üri gem ioe aa 


A= 


o e.s © © ¥ 


where det A = A 0. 
We form the so-called adjoint of matrix A: 


PAu Aa +) Am] 
(2) 


Oeo eo o. e bo p ‘X 


A 


where A;; are the cofactors (signed minors) of the corresponding 
elements a;, i(i j=1, 2, carey n) 

Note that the cofactors of the elements of the rows lie in the 
corresponding columns, that is, the transpose operation is per- 
formed. 


in 


7.4 inverse matrix 237 


Divide. all the elements of the last matrix by the value of the 
determinant of A (by A that is): 


An An Ant 











A “A” A 
Ar Ar Anz 
A*=| 4 A ‘"" A (3) 
Ain Amn Ann 
A AT AY 


We will prove that the matrix A* is the required inverse: 
A* wa 

As we know, (1) the sum of the products of the elements of 
a certain line (row or column) of the determinant by the cofactors 
of the elements is equal to the determinant, and (2) the sum of 
the products of the elements of a line of the determinant by the 
cofactors of the corresponding elements of a parallel line (row or 
column) is equal to zero; thus 


: 2 QirA j= 5;;A (4) 
k=] 
and l 
P = 5; ;A (4°) 
where 
1 for i=], 


it 0 for i-j 


Using these properties, form the product AA* to get 
[Au An, Am “| 











TAa Mig +»: Ay, A A A 
a a. a Ai Ar An 
AA*= 21 22 °° * 2n A A eye A = 

An Ong Ban J Ain Aon Ann 
LA A A 

ro, 0 

01... 0 

00... 1J 


Thus, AA =E. 


238 ` Ch. 7. Matrix Algebra 


Formula (5) can be derived faster if we use the compact nota- 
tion 

Aji 

A= [a;;] and Ar= |] 


Taking reiation (4) into account, we obtain 
n 
* Ajr 1 
AA* = pa t) = [8;] = E- 
k= 


Similarly, we see that A*A=E. 
Hence, A*=A™~}, or 
ie 1 
where : 
A= det A 


Note 1. The inverse A? of a given matrix A is unique. What 
it more, every right inverse (left inverse) of A coincides with its 
inverse A~} (if such exists). 

Indeed, if 

AB=E 
then, premultiplying this equation by Aq}, we get 
ATAB = AE 


B= ACY 


We similarly prove that if 
CASE 


or 


then C= A“. 
Therefore, when verifying relation (1) one equation is sufficient. 


Note 2, A singular square matrix does not have an inverse, True 
enough: since the matrix A is singular, 
det A=0 
From (1) we have 
det A- det A™ = det E =1 
or 
0= 1 (?!) 
which -is impossible. The assertion is proved. 
Example. Find the inverse of the matrix 


bes 8 
A=|—2 —4 —5 
3 5 6 


7.4 inverse matrix 239 


Solution. Since the determinant 


1 2 8 1 2 3 
A=|—2 —4 —5/=/0 0 Iļ=1#0 
3 5 6 0 —i —3 














the matrix A is nonsingular. 
Form the adjoint matrix 


1 3 2 
vi ee eee: seen ae 
2 1 0 


Divide all elements of A by A=1 to get 
1 3 A 

At=/|—3 —3 —] 

br 


The reader is advised to verify that we indeed have 
AASE 


Below are some of the basic properties of an inverse matrix. 
1. The determinant of an inverse matrix is equal to the recipro- 
cal of the determinant of the original matrix. Suppose 


AVA=E 
Taking into consideration that the determinant of a product of 


two square matrices is equal to the product of the determinants 
of the matrices, we get 


det A~! det A= det E=1 
Hence 
1 


det A=! = det a 


2, The inverse of a product of square matrices is equal to the 
product of the inverses of the factors taken in reverse order: 
(AB)~!= B-1A71- 





Indeed, 
AB(B™A7)=A(BB7) A~ = AEA ~ = AA“ =E 
and 
(B-47) AB= B~ (A714) B= BEB=B“B=E 


Hence B~!A47! is the inverse of AB, 


240 Ch. 7. Matrix Algebra 


In the more general case, 
(A,A,...A,)"t= Ap Apt... Ar" 
3. The transpose of an inverse is equal to the inverse of the 
transpose of the given matrix: 
| (47 = (4A) 
Taking transposes of A~1A=E, we get 
(ATAY =A’ (ADS = E' = E 
Whence, premultiplying the last equation by the matrix (A’)7?, 
we obtain 
(A) A’ (AN =(A') E 
or 
(AY = (A) 


which is what we set out to prove. 


Note. The matrix equations 
AX=B and YA=B 
are easily solved by means of an inverse matrix. 
If det A=40, then 
X=A™ 7B and Y=BA™ 


7.5 POWERS OF A MATRIX 


Let A be a square matrix. If p is a natural number, then put 
AA...A= A? 
———— 


p times 
We also agree that A°=E£, where E is the unit matrix. If mat- 
rix A is nonsingular, we can introduce a negative power and de- 
fine it by the relation 
A~P=(A~!)jP 
The ordinary rules hold for powers of matrices with integral 
exponents: l 
(1) APAT = AP+s, 
(2) -(A?)? = Ara. 


It is obviously impossible to raise a nonsquare matrix to a 
power. 


7.6 Rational functions of a matrix 241 


Example 1. Let 


a, 0 0 
A 0 a. 0 
0 0 On 
Then 
ak 0 0 
p 
gea E ee 0 
0 0 ap 
Example 2; Find. 
010 O72 
0010 
000 1 
1.0000 
Solution. We have 
010 072 010 0770100 0010 
0010) [0010 0010) |0001 
0001| JO0O00 1} 0001] [0000 


0000 000 0JL000 0 0000 


If A and B are square matrices of one and the same order, and 
AB= BA, then the binomial formula holds true: 


p 
(A+ Byr= X C}A*Br-* 


7.6 RATIONAL FUNCTIONS OF A MATRIX 


Suppose 
Xi Žige A 
X= Xa Mag see Xen | 
Xni Zna +e Xan 


is an arbitrary square matrix of order n. By analogy with the 
formulas of elementary algebra we determine the integral rational 
functions of the matrix X: 


P(X)=A,X"4+A,X*14..,+A,E (right polynomial) 
P(X)=X7A,+-X"4,+...+EA,, (left polynomial) 


16 9616 


242 Ch. 7. Matrix Algebra 


where Ay(v=0, I, ..., m) are mxn or, respectively, axm 
matrices and E is a unit matrix of order n. 

Generally speaking, P(X) P(X). 

It is also possible to introduce fractional rational functions of 
the matrix X, defining them by the formulas 


R(X) = P(X) [Q (X) 
and 
2(X)=[Q(X)] 7? P(X) 


where P(X) and Q(X) are matrix polynomials and det [Q (X)] £0 


Example. Let 
paaa ga 
ASAE A 


0 0 
where X is a variable matrix of order two. Find P eh a 


Solution We have 
0 0 0 0}? fi Are o0 0 1l 
Piip aI arol ce 
0 0 —I 0 0 1 =e 
= + = = 

0-0 1 0 1 0 0 0 

7.7 THE ABSOLUTE VALUE AND NORM OF A MATRIX 

The inequality 


A<B (1) 


between two matrices A= [a;,] and B= [0;,] of the same size means 
that 


iyi; (2) 
In this sense, just any two matrices are not always Ba es ae 
We use the term absolute value (modulus) of a matrix A= [a;;] 
to mean the matrix ` 
|Al= [lal] 


where {q@;,| are the moduli of the elements of 4. 

If A and B are matrices for which the operations A+B and AB 
are meaningful, then 

(a) |4+B/<|A| +18], 

(b) |AB| <|A]-|B], 

(c) ja4]=]a] |A] 
where œ is a scalar, 


7.7 Absolute value and norm of a matrix 243 


In particular, we get 
(Æ| S]A] 
where p is a natura] number. 


By the norm of a matrix A= [a;] we mean a real number || A |} 
n satisfies the a. conditions: 


eae yal A||=0 if and only if A=0 

°) Ma =|a| ale aa o and, in particular, |—A ||=|| A], 
|A+Bil< | Alsi 

(a) Janien ILB I| 


A and B are matrices for which the corresponding operations are 
meaningful.) As a particular instance, for the square matrix we 
have ; 


APs] All? 
where p is a natural number. 


We note one more important inequality between the norms of 
matrices A and B of the same size. Using Condition (c), we have 


|B||=]4+(B—A) | <I A+ B—4| 


whence 
oe | A—B||=}B—A] >] B]|- All 
Similarly 
A~—B]| >] 4-18] 
hence 


}A—B 2 | Bil All 


We call the norm canonical if the: following two conditions hold 
true as well: 
(e) if A=[a;,], then 


lay |S Al] 
and for the scalar matrix A=[a,,] we have || A||=]a,,|, 
(f) from the inequality |Ai<}B| (A and B are matrices) fol- 


lows the inequality 
lAs! 


In particular, || A |[=|l] AJ]. 
In the sequel, for any matrix A= fa; ] of arbitrary dimensions 
we will consider mainly three norms that are easily computed: 


(1) (| Alla = max X Ja;;| (the m-norm), 
t f 

(2) || 4l] = max $ la; | (the t-norm), 
i 4 


3) |Alle= Vy 2 a; (the k-norm). 


244 Ch. 7. Matrix Algebra 


i: 3 
A=|4 5 6 
7 8 j 
We have 
A lla = max(14+2+3,445+46, 74+8+9)= max (6, 15, 24) = 24, 
|All = max (1 +4+7, 2+5+8, 3+6+9)= max (12, 15, 18)= 18, 
lAl =V PFE Se- ETHE = 
=V1+4449+4 164 25+ 36+ 49+ 644 81 =) 285 ~ 16.9 
In particular, for the vector 


Example. Suppose 





Mx] 


Xe 


LXaj 
these norms have the following values: 
I a= max |x], 
| Il =| x+ |x] + t. + 14,1, 
elle =lel=V le Pie P+ FTP 


(the absolute value of the vector). If the components of the vector 
are real, then we simply have 


|e. =Veit et. Pe 


Let us verify the Conditions (a) to- (d) for the norms || All, 
All, and || Ally. | i 

It is immediately obvious that the Conditions (a) and (b) are 
fulfilled. Let us assure ourselves that Condition (c) is fulfilled as 
well. Let A=[a,;] and B=[b;,), A and B being of the same size. 
We have 


PAB her max Dle bs max {Bay l+ Zld N < 
< max X Ja; |+ max S16, |=] Alla + Bll 


i 
Similarly 
A+B I <P Al+ 14 lk 


7.7 Absolute value and norm of a matrix 245 
Furthermore, 
in 2 ay 2 2 f 
| A+B ik= y may +b s< yV Dilay +È | 6:5"? +2 my lanien 
i t, j i, j i, j tj 
Applying the familiar Cauchy inequality” 


Bay loih< y Slay y Zea 


1) The proof of the Cauchy inequality follows: 


n 2 n n 
5 

Dd 4sbs < la Xios} 
s=] s=] s=1 








where a, and bs (s=1, 2, ..., n) are arbitrary complex numbers. Let à be a 
real variable. We consider the obvious inequality 


> Jash+-b,e!% |? => 0 (x) 
s= 1 


where gs are some rea] numbers. Donating by a, and bẹ the conjugates of a, 
and bs, we have i: 


| ash + deel? P= (ash+b,e'%*) (ash + bse~ 10s) = , 
= aa,\2+-(asb,e~'% agbe?) A +bsbs= 
=| a; |? A242 Re(asbse™ ts) A+] bs |? 


The inequality (+) then becomes 


n n n 
WY las |424 $ Re(asbse™®:) +- D | bs P= 0 
=i =] =] 
If we put i : : 


= arg (asb ) 
then us 7S 


Re (asbse™ tPs) = Re {| agby | ef E (456s) „e= targ (að) 
= Re {lasts I} =l a,b, |=] asbs | 
and, consequently, 
a? Žiaran $ lash +5 [bs 2 =0. 
s=] 


Since, by virtue of the inequality (æ), the left member of this inequa- 
lity is nonnegative for arbitrary real À, the corresponding quadratic equation 
cannot have distinct real roots. Therefore the discriminant of the equation is 


($ iasi} — È lae $ ios P0 


s=} 
that is 


s=ł 


(fra aar Sier 


{see over) 


246- Ch. 7. Matrix Algebra 


we get 


AFBI y Blah + y Plu = MAT FU Bie 


The Condition (c) is thus fulfilled for all three norms. 

Let us now verify it for Condition (d). Suppose matrix A = (a; ] 
is of dimensions m xn’ and matrix B=[b,,] is of dimensions m” xn’, 
For the possibility of multiplication of the first matrix by the second, 
it is necessary that m” =n’, and the matrix AB will have the dimen- 
sions m’ xn’. 














We have "yg 
| AB |a= max 2 | È asas; | < 
i j= = 
< max P Slavia = 
= max S | ais | 2 bul} < 
i s=1 j=1 
< max = Jais |- Blt = 
= max d $ lal} 18 a= Aln IB 
Similarly 
|| AB |], = max 2 È aisbs;| < 
j asl ssl 
Zee 
j i=[ s=ł 


a. 


2 $ | ajs \\ Bs; I} = 


Jos 1A lp = 


= 4 lemar Èloyl= 4 lll 


whence, all the more so, 


n 2 


n 2 n n 
D asbs <l $ | asbs }<$ | as |?- > [bs | 
sa] s=] s=t s=l 
If the numbers a, and by are real, then we simply get 


n 2 n n 
(3 ots) <$ aSo 
s=] s=lk s=} 








7.7 Absolute value and norm of a matrix 


247 
- Furthermore, 
ashy $ $| Sesc y ZS (Seo. T 


Applying the Cauchy inequality and taking into ‘account that 








m” =n’, we obtain 


|4Ba<V 2 ; 
-y ES iat $ Sot SS eyt- VARTER- ANB 


i=] s= 
Hence, Condition (d) is fulfilled for the norms under considera 





tion. 
We will now show that the norms | 4l || All, and || All, 
are canonical. 
If a,, is the largest, in modulus, element of the matrix A = [ay] 
of dimensions m xn’, then we obviously have 
e E E les: 
tlag). Hlawa E | apl 


Il Alle Lag | + --- 


and 
|| A -y & pa laz ? | apg | 


Thus 
layl <la S] Alls (s=m, L, k) 


if A= [a], then 
IA la = I Alle = IIA lle = lal 


Furthermore, if |A|<]|Bj, where A= [a;] and B= fb], t 
<|0;;|. From the definition of the norms lA lla Alle a 


vi |, it is obvious that the inequalities 
Ee (s=m, l, k) 


Besides, 


hold true. 
Besides, for any one of the norms we have 
WAIls=i1Ajll, (s=m, L, k) 


Thus, Condition (f} is also fulfilled. 
We have thus proved that the norms || 4a || All, and |] Al, 


are canonical. 


x 


248 Ch. 7. Matrix Algebra 


Note that if matrix E is a unit matrix of order n, then 
Elle =I Ell = ! 


|Elle=Vn 


7.8 THE RANK OF A MATRIX 
Suppose we have a rectangular matrix 


and 


Oy Aye aid a, | 


| emi Omz nat Omn | 

If in this matrix we choose in arbitrary fashion & rows and 
$ columns, where k& min(m, n), then the elements at the inter- 
sections of these rows and columns form a square matrix of order k. 
The determinant of this latter matrix is called a minor of kth 
order of the matrix A. 

Definition. The rank of a matrix is the order of the nonvanishing 
minor of greatest order of the matrix, In other words, a matrix A 
has rank r if: 

(1) there is at Jeast one minor of rth order different from zero; 

(2) all minors of A of order r+ 1 and higher are equa! to zero. 

The rank of a zero matrix, that is, one consisting of zeros, is 
taken to be zero. The difference between the smallest of the num- 
bers m and n and the rank of the matrix is termed the nullity 
of the matrix. If the nullity is zero, then the rank of the matrix 
is the largest of possible ranks for the given dimensions. 

When determining the rank of a matrix it is useful to abide 
by the following rules: 

(1) go from minors of small orders (beginning with minors of 
order one, that is, with the elements of the matrix) to minors of 
larger orders; 

(2) suppose we have found a nonzero minor D of order r, then 
we have only to compute the minors of order (r-+1) bordering 
the minor D. If all these minors are zero, then the rank of the 
matrix is r; but if even one of them is nonzero, then this opera- 
tion has to be applied to it, and then the rank of the matrix is 
definitely greater than r. oF 

Example. Find the rank of the matrix 
2 —4 3 1 07 
1 —2 1 —4 2 
0 1—l 3 41 
4 —7 4—4 5 








Le J 


7.9 Limit of a matrix 249 


Solution. The second-order minor in the upper left-hand corner 
of the matrix is zero. However, the matrix contains nonzero minors 
of the second order, as for example ` 











a 3 
D= => io 
and the third-order minor UA it: 
2—4 3 
D'=|1 —2 ljļ=łb 
0 1 —i 
and both fourth-order minors bordering D’ are equal to zero: 
2—4 3 1 2 —4 3 0 
1 —2 I —4 1—2 1 2 
i deg’ Gale. Mie ole ple 
4 —7 4—4 4—7 4 5 


Hence the rank of the matrix is three, and the nullity is 4—3 =1, 


7.9 THE LIMIT OF A MATRIX 


Suppose we a a sequence of matrices 


=[aP] (k=1, 2, ...) (1) 
of the same aces mxn(i=1, 2, ..., m, f=1, 2,... , 1). 
By the limit of the sequence of matrices A, is meant the matrix , 
A= lim A,={ lim ajP] (2) 
k> œ k> œ% i 


A sequence of matrices having a limit is called a convergent se- 
quence. 


Lemma 1. For a sequence of matrices A,(k=1, 2, ...) to con- 
verge to the matrix A it is necessary and sufficient that 
|| A—A, || +0 as k — œ (3) 
where ||A|| is any canonical norm of A. Here 
Jim Agl= (14 
Indeed, if 
A, A= [ay] 
then f 
jay— aP | <z for k>N(e) 
whence 


|A— A| <el 


250 Ch. 7. Matrix Algebra 


where J is an mxn matrix all elements of which are unity. By 
the properties of the norm we have 


| A—A,||<ell/|| for £>N(e) 
hence, 
jim || A—A,||=0 (4) 


Conversely, let Condition (3) be valid. Then for & >N (e) we 
have 


laap |<] A—A,|| <e 


and hence 
pie Ros 
lim ai? = ay, 
k> œo 
that is, 
k 


Besides, if A, A, then we have 


ii A Ill Akili <l] A—A,; || —0 as k— oo 
Therefore 
lim A= 14 | 


Corollary. The sequence A,-+0 as k— œ if and only if 
lim || A, || =0 
k> œ 
where || A,|| is some canonical norm. 
It is easy to see that if 
lin A,=A and Jim B,=B 
k> œ -> a% 
then 
(a) lim (A, + B,)= A+B, 
k> œ 
(b) lim (A,B,) = AB, 
(c) lim Ag?= A7! (det A 0) 
k> œ 
on the assumption that the corresponding operations are meaningful. 


In particular, if C-is a constant matrix such that the multiplica- 
tions CA, and A,C (k=1, 2, ...) are possible, then 


lim CA,=CA 
k> w 


and 
lim AC = AC 
kon 


7.10 Series of matrices 251 


Som 2. For the convergence of a sequence of matrices 
A, (k=1, 2, ...) if is necessary and sufficient that the generalized 
Cauchy e hold, namely: for any s > 0 there must be a number 
N =N (e) such that for k >N,pp>0 


I Arp All <e (5) 


where |} || is any canonical norm. 

Indeed, if inequality (5) is valid, then for every element a, of 
matrix A, the Cauchy test {see Sec. 3.4) will hold, and hence 
there exists 


7 


lim A,= [ lim aș] 


k> 
Conversely, if there exists 
A= lim A, 


k> © 


then by Lemma 1 
|| A—A,|| +0 as k—oo 
and, thus, the inequality (5) holds true. 


7.10 SERIES OF MATRICES 


Using the concept of a limit of a matrix, we can introduce 
series of matrices (matrix séries): 


> Ap= lim > A, (1) 
k=l N-+@ k= 
where A, are matrices of the same dimensions. 

If the limit (1) exists, then the matrix series is convergent, and 
the matrix obtained in the limit is termed the sum of the series. 
If the limit (1} does not exist, the matrix series is divergent and 
no sum is assigned to it. 

A necessary condition for the convergence Of a matrix series, 


Theorem 1, /f the matrix series (1) converges, then 


hi =0 
ima 
Proof. Let 
k 
S, = p> A; 
If the series (1) converges, then there exists the finite limit 
S= lim S, 


k+@ 


252 Ch. 7. Matrix Algebra 


We have 
A= Se — Skai 
whence 
Jim i A= lim Sy lin | S,-1=S—-S= 0 
k> œw 
The matrix series (1) is called absolutely convergent if the following 
series converges: 


È | As! (2) 
k=] 
Theorem 2. An absolutely convergent matrix series is a convergent 
series. 
Proof. Let 
A, = [a] (k=1, 2, ...) 
Then 


PA a [Slay 


Since the matrix series (2) converges, then, by definition, each 
of the numerical series 


Daw (=1, 2,... m, j=l, 2. .... n) 


is convergent. From this it follows by a familiar theorem of the 


theory of series that all the series Š a, (C1) ... HIS Ty See, 1 
k= 
converge, and conyerge absolutely; that is, there is a limit 


S=lim Sw=lim > A, 
N= o. 


and hence, the matrix series (1) converges. 
For a rough analysis of the convergence of the tmatrix series (1) 
we can take advantage of the sufficient condition given below. 


Theorem 3. /f |[ Al] is any canonical norm and the numerical 
series 


DAN (3) 
converges, then the matrix series (1) also converges, and converges 


absolutely. 


Proof. Let 
A,= [a$ (k=1, 2, ...) 


7.10 Series of matrices 253 


` We consider the numerical series 

ao 

y yh 
Ps aj (4) 


(== 1, 2, ..., m j=1, 2, ..., nj} Since 
Lae? |<] Ag Ih 


it follows that each of the series (4) converges, and converges 
absolutely. Hence, the matrix series 


3 4=[Sar| 


by definition converges, and converges absolutely. 
Very important in applications are matrix power series: right- 
hand, 


Ms 


A,A* (5) 


k=0° 


ft 


and left-hand, 
> X*A, (5°) 
k=0 


where X is a square matrix of order n. In the first case, A, are 
mxn matrices or scalars (for instance, A, may be row vectors); 
in the second case, A, are nxm matrices or scalars (for instance, 
A, may be column vectors). 


Theorem 4. If r is the radius of convergence of the scalar po- 
wer series 


Shale (6) 


where || A l| (k=0, 1; 2, ...) is some canonical norm, then the 
matrix power series (5) and (5') definitely converge for 


XI <r (7) 
In particular, the matrix power series 


oO 
D a,X* 
k=0 


with numerical coefficients a, (k=0, 1, 2, ...) converges for 
IXI <? 


254 Ch. 7. Matrix Algebra 


where r is the radius of convergence of the power series 


È | a, | x* 
k=0 
Proof., Since 
I 4X* s<] Ag il] X I} 
then, when inequality (7) holds, the series 


È Ax 


converges. Consequently, by Theorem 3, the power series (5) also 
converges. 

Similar reasoning holds true for the series (5’). 

The second assertion of the theorem follows from the fact that 
if a, is a scalar, then 


< 


Il ell =} 24] 


Theorem 5, The geometric series 


A+ AX +AX*+...+AX* +... (8) 
and 
A+XA+X?A+...+X*A+... (8’) 
where X is a square matrix, converge if l ; 
|X |< (9) 
Here 
2 AX =A (E— X) 
and i 


2 X*A=(E—X) Á 
ep 


Indeed, by Theorem 4, given the condition (9), the geometric 
series (8) converges, that is, there exists a finite matrix 


S= 5) AX? 
% k=0 
Consider the identity 


A(E+X4+X?4...4+X*)(E—X) = A(E—X*+}) (10) 


Passing to the limit as k — oo in (10) and noting that by virtue 
of Condition (9), 


X*+1_,0 as k— o 


7.10 Series of matrices 255 


we will have 


S(E—X)=AE=A (11) 
In particular, assuming A=€E in (11), we get 
S,(E—X)=E 
where 
S= ba XF 
k=0 
From this 


det S,- det (E—X)=det E=1 
Since det S, is finite, it follows that 
det (E— X) 40 
and, consequently, the matrix E—X is nonsingular, which means 
it has an inverse, (E—xX)7?. 


Multiplying both members of (11) on the right by (E—X)7}, 
we finally get 


Ms 


S= Ý AX SA(E= XK) 


k=0 


In a similar manner we prove that 
> X*A=(E—X)3A 
k=O 


for 
x] <1 


Corollary. If ||X||< 1, then there is an inverse matrix 
(E—X) i= J) X* 
k=0 


What is more, if || E||=1, then 
. l 
I(E —X) NS SOAP = Tog 


Note. If Jas l, then it is easy to estimate the norm of the 
remainder of the matrix series, (8). 
We have 


R =|| A(E—X)1—A(E4+X 4X74... 4X4) |] <j] All| Xt + 


k 
PAH IS ANC XP X pee) LAA 


256 Ch. 7. Matrix Algebra 


Similarly for the series (8°) we have 

[| (E—X) > A— (EX +X... 4X) A CUAL 

Matrix series make it possible to determine the transcendental 
functions of a matrix. For example, it is assumed that 


ioe] Xn 


7 n! 


oo (12) 


and it is possible to prove that the series (12) converges for an 
arbitrary square matrix X. 


7.11 Partitioned matrices 


Suppose we have a matrix A. We partition it into matrices 
of lower orders (submatrices, or blocks) using horizontal and ver- 
tical partitions that run through the whole matrix. For example, 


By Qie | Qiz 
A= Ay Ayo | Qos 
By As, | O53 


where the blocks are the submatrices 


Ay, he Qiz 
p=] |: o=(* i R= [as a,,], S= [a] 


Aa Azz 23 





Then A may be regarded as a supermatrix whose elements are 
blocks (submatrices): 
P QR 
R l S | 


A matrix partitioned into submatrices, or blocks, is called a par- 
titioned matrix. Quite naturally, the partitioning of a matrix may 
be done in a variety of ways. A special case of partitioned mat. 
rices are the quasidiagonal matrices 


q 
4, | 





L- - 





7.11 Partitioned matrices 257 


where the blocks A; (i=1, ..., 8) are square matrices of (gene- 
tally speaking) different orders, all other elements being zeros. 
Note that 


det A = det A,...det A, 


Another important special case of partitioned matrices are 
bordered matrices 





A, = 
where 
Ty Ay ot ay, nol 
—|a Le. 
Ay = 21 Azz 2, n=l 
5 L&n-1,1 Un-aye ++ Snot, 0-1 
is a matrix of order n—1; 
Ben 
Coin . . 
U, = is a column matrix; 
Cân, n 


Va = [an On, +++ Cn, n-a] 1S a row matrix and a,, is a scalar. 

Let us agree to use the term conformal to designate partitioned 
matrices of the same dimensions and with the same partitioning. 
Partitioned matrices are convenient in that operations involving 
them are carried out formally by the same rules used for ordinary 
matrices. 


A. THE ADDITION AND SUBTRACTION 
OF PARTITIONED MATRICES 


If the partitioned matrices 
A Å. Ay 


Aad: a cielo (1) 
Ap Ap Apg 
and 
B; By: NAN Bis 
Bea Coolie? 28, co ees (2) 
By Be. B,s 


17 9616 


258 Ch. 7. Matrix Algebra 


are conformal, that is, p=r, q=s and the blocks A,, and By 
have the same dimensions, then 


ae Aa t By, ATACH a 


Ant+By, Ap + Bp ee Ang + Bog 


Indeed, in order to add the matrices A and B it is necessary 
to add the corresponding elements, but it is obvious that the same 
thing is achieved if we add the corresponding blocks (submatrices) 

of these matrices. 

Subtraction of partitioned matrices is performed analogously. 

Hi A is the partitioned matrix (1) and œ is a scalar, then we 
have 


B. THE MULTIPLICATION OF PARTITIONED MATRICES 


Suppose the partitioned matrices A and B have the structure 
given in (1) and (2), respectively, and g=r. 

Assume that ali the blocks A; and B; (¢=1, E Di 
| 7 ere a 2, ..., S) are such that the scans: of 
columns of block A iy is equal to the number of rows of block B; 
In the special case when all blocks 4, and B;; are square and 
have the same order, this assumption is definitely fulfilled. Then 
we can prove that the product of the matrices A and B is the 


partitioned matrix 
Cy Cis Eee Cis 
Ca] Cn Coe e Cas 


where S ea ie oe Ci =< 1, 2, p: i 
k=l, ..., 8), that is, the matrices A and Bare multiplied 
cate as if the blocks (submatrices) were numbers [2]. 


Example. Multiply the partitioned matrices 
r 





7.11 Partitioned matrices 259 


and 
E 
t 
ese 
4 
z 
i | T | U 
Laas J 
_to get a matrix of the form 
r 





AB= 


2 | PR4OT | PS+QU 
+ 





L- -J 


Addition and multiplication of quasidiagonal matrices are 
especially simple. If ; 





and the orders of the matrices A, B; (i=1, 2, ..., 8) are the 
same, then we clearly have 





260 Ch. 7. Matrix Algebra 


and 


AB= 


A,B, 








L 





7.12 MATRIX INVERSION BY PARTITIONING 


Suppose for a given nonsingular numerical matrix A it is re- 
quired to find the inverse A~?. Partition A into four submatrices: 


Pa P (rr) aa (R A 
Choy (s, r) Qos (s, s) 

The orders of the submatrices are indicated in parentheses; and 
r+s=n, where n is the order of matrix A. We seek the inverse 
A`? also in the form of a four-block matrix, 

aafe rj Bil o 
Pa (s, r) Ba (5, s) 

Then, since A7~1A=E£, by multiplying the matrices we get 

four matrix equations: 

Piai + Pita = E,» 

Piae + Biza = 0, S 

Baai + Poea = 9, 

Borie + Boa = Es 
where E, and E, are unit matrices of appropriate orders. Solving 
this system, we determine the blocks of matrix A}. In solving (1), 
we use the method of eliminating unknowns. Postmultiplying the 
first equation of (1) by a;/a,, and subtracting the second equation 
of the system from the resylt obtained, we have 

Bye (AAT Aie — Ora) = Ay Oye 
whence we find 
Pia = — Ope, (Xoo — AAT Agg)! 

and 


Bay = Oz —B, 0,07) 


7.12 Matrix inversion by partitioning 26f 


Similarly, from the third and fourth equations of (1) we get 
Pao T (Ora — XAI Xa) 
and i 
Boi = — Paanan 
It is of course assumed here that the corresponding operationà 
are meaningful. We introduce the matrices 
X =at Y=a az}, \ 
8=a,,—%, X=a,,—Yo,, 


(2) 


Then the formulas for the blocks B; (i, /=1, 2) may be writ- 
ten more simply: 


By =O + XOY, 
Bs, = — X07, 
By=—O71Y", Ba =O"? 
Formulas (1) determine the blocks of matrix A~! provided that 


az; and 071 exist. It is convenient to arrange the computations 
in the following scheme [4]: 











97: | Y = a b= an — Yon, 
and 
an +X0Y | —X97! 
A= 
—9-1Y | 9-1 


This method is useful if matrix «œ, is readily invertible. 
Example 1. Invert the matrix 


1 03 —4 
0 15 6 
—3 40 2 


—5 —62 0 


262 Ch. 7. Matrix Algebra 


Solution, Set 


—3 4 02 
mT papp teg 





T, f z 16 Bae l È 236 al 
7142215 6})[—47 —11]~ 1422]}—202 104]? 
i 16 —3 4 ı [—218 —140 
Brar = a —47 E 5 n] 196 E 
236 146 4 , [—1438 68 
a e 202 al E 6] =zz] 86 E 
As a check, we compute the product X6~Y in two ways: 
XoY =(XO-1)Y and YOUY = X (971Y) 
By the general scheme we have 


—16 68|—236 —146 
i 86 —10} 202 —104 
tao |” 218 i40 16 34 

|—196 122] —47 —11 


A particular case of the foregoing method is the so-called meth- 
od of bordering. Essentially it is this. Suppose we have a matrix 


Aa 


7,12 Matrix inversion by partitioning 263 


We form the following sequence of matrices: 
S,= [au]: 





Gy, Ay, 
Ss l ) 
Aa Qag 
Qi Qie Qirg S Gig 
— i 2 
83= Axx Aggy Agg — zp 
Qz Ag gg As, Aga Ass 
Qil Ay, Qis Ay aig 
= Aa Q Qog Ags = S; Og 
Q31 Qaz gg Ogg Qa 


By Qaz Qag Qaa Laar Aas Asa| A4 


and so on. Each matrix is obtained from the preceding one by 
bordering. The inverse of the second of these matrices, 571, is 


found directly: 
Gon ag 
s A A 
oy) = a21 ay 
A A 


A = 0110ra — 04,04 


A 


where 


Then, applying Sy* to $, by the foregoing scheme of computa- 
tion, we get Sz; then we use S;* to get Sy?, and, finally, Sy?=A7!. 

The method of bordering cannot be applied if one of the inter- 
mediate matrices S; is singular. The situation can, however, be 
rectified by an interchange of the rows of the matrix [5]. 


Example 2. Find the inverse of the matrix 


1 41 3 
0 —1 3 —l 
a=lg ap è 
1 —2 5 | 


Solution, Here 


1 4 
s=|) aps 


264 Ch. 7. Matrix Algebra 


The scheme for computing S;* is as follows: 





13 143 
RB 36 

— S — 
XƏTY = | n 
F 12 

Hence 

Fa. <i) 26 
73 36 36 
Ses. fu eee 
g 4 2 12 
¿2 Dj 1 
12 36 36 | 


The following scheme can be used to compute Sy?: 


[~~ 1 —2 “s] 1. 





ie 1 31 7 

E -3 99 
KEV Sot 3I 7 
22 66 66 


Ug che Ste 
L 132 ~ 396 ~ 396} 


7.13 Triangular matrices 265 


Hence 
5 15 19] _ 2] 
44 44 44] T 
9 Wooly 8 
44 a (44 1l 
So=A%=| 4 w 2 |= 
44 4 4 22 
3 al 7 9 


—5 15 19 —8 
! 9 V 1 —12 


3 —31 —7 18 


7.13 TRIANGULAR MATRICES 


Definition. A square matrix is called triangular if the elements 
above (below) the main diagonal are zero, For example, 


a ete 
O tat 


ee a 
where ¢#;;=0 for i> j is an upper triangular matrix. Similarly, 
Pfa 0 wa 


where ¢;;=0 for j >¢ is a lower triangular matrix. 

A diagonal matrix is a particular case of both an upper and 
lower triangular matrix. The determinant of a triangular matrix 
is equal to the product of its diagonal elements, namely, if T=[t;;] 
is a triangular matrix, then we obviously have det T = t,,t,.... tin: 
Therefore, a triangular matrix is nonsingular only when all its 
diagonal elements are nonzero. 

It can be proved that (1) the sum and product of triangular 
matrices of the same dimensions and the same structure (that is, 
upper only or lower only) are also triangular matrices of the same 
dimensions and general structure; (2) the inverse of a nonsingular 


266 Ch. 7. Matrix Algebra 


triangular matrix is also a triangular matrix of the same dimensions 
and structure. Utilizing this circumstance, we cai readily invert 
a triangular matrix. 


Example 1. Invert the matrix 


1 0 0 
A=/1 2 0 
1 2 3 
Solution, Set 
th, 0 g 


0 
T=] fa fa 0 
toy boa bog 


Multiplying the matrices A and A™}, we get 
t=, ty + 2ty + 3t; =0, 


t+ 2ta = 0, 2t,, + 3t,, = 0, 
2i = l Stag — 1 
whence we successively find 
1 1 
ty=l, tu=— >> lno 
l 1 
ta =O, ta2=— 3 3 
Consequently 
l 0 0 
] 1 
“isj 2 2 0 
1 1 
rma. Ss 


The following important theorem holds true [3]. 
Theorem. Any square matrix 


Ai yg ere Qin 
A Aa Age Asn 
Anı Qng ad Onn 


with nonzero principal-diagonal minors 


Ajg Aig 
Q21 Qag 


may be represented as a product of two triangular matrices of dife- 


A, =4,%0, A= #0, ..., A,=|A|#0 








7.43 Triangular matrices 267 


rent structure (lower and upper); this expansion will be unique if 
the diagonal elements of one of the triangular matrices are fixed 
beforehand (say by putting them equal to 1). 

We omit the proof but indicate a method for finding the ele- 
ments of the desired triangular matrices. Let 


A=T,T, (1) 
where 
is a lower triangular matrix of order n; 

T,=[¢;j], ¢;;=0 wheni>j (3) . 


is an upper triangular matrix of order n. Multiplying together these 
matrices we get, by formula (1), 


PAT (i, j=l, 2, wavy n) (4) 
Due to Conditions (2) and (3), system (4) becomes 


j 
pai ile; — Oi; for ip] (P= ly 2, ..., n) (4’) 

and m 
D bray for i<j (@=1, 2, ..., n—1) (4") 


Because of their peculiar structure, the systems (4’) and (4”) are 
readily solved to within the diagonal elements git and c,;. For the 
sake of definiteness, we can put cp=1 (¢=1, PAi n). 


Example 2. Represent the matrix 


1—l 2 
A=|-—!1 5 4 
2 4 14; 


as a product of two triangular matrices T, and T,. 
Solution A=7,T, We seek T, and T, in the form 


fa 0 0 L fis Tis 
Tah fg 0 and T,=/0 1 ra 
tor toe tas 0 0 1 
We have 


1 —!] 2 tay tile til as 
—1 5 4 |= tot tol te + tee falis +F tool os 
2 4 14 ta sie t baz beifist Ésera tls 


268 Ch. 7. Matrix Algebra 


whence 
Lied, tly =—l, fifi =2, 
ta =l, fari ttam, taria t ll = 4, | 
laim 2, taie t bae = 4, batia t baare + tes = 14 
Solving the system, we get 


ta=l, ti =l th = 2; 
t= 4, t,, = 6, ts,= 1, 
Tn, =—l, Nye, a 
Thus 
1 0 0 
T,=|—l1 4 0 
26 1 
and 
1 —1 2 
T,=/0 > 
0 0 1 


Using the representation of a square matrix A (det A 0) as a 
product of two triangular matrices, we can indicate another method 
for computing the inverse A~!; namely, if 


A=T,T, 
then 
AwW=TS*T} 


We thus see that finding the inverses of triangular matrices is 
a comparatively simple affair. 


7.14 ELEMENTARY TRANSFORMATIONS OF MATRICES 


The following matrix transformations are called elementary: 

(1) interchanging two rows or columns; 

(2) multiplying all elements of one row (column) by the same 
nonzero number (scalar); 

(3) adding multiples of the elements of a row (column) to the 
elements of another row (column). 

Two matrices are termed eguivalent if one is obtained from the 
other via a finite number of elementary transformations. Such mat- 
rices are not, generally speaking, equal, but, as may be proved,- 
they have the same rank [6]. 

It is easy to see that every elementary transformation of a 
square matrix A is equivalent to multiplying A by some nonsingular 


7.15 Computation of determinants 269 


matrix. Then, if the transformation operation is performed on rows 
(columns) of A, the multiplier must be a postmultiplier (premulti- 
plier) and be the result of applying the corresponding elementary 
transformation to the unit matrix [6]. For example, if in the matrix 


Gy, Qis ig 
A= Bey Qa Ang 
Q31 Qas Ogg 


we interchange the second and third rows, we get the equivalent 
matrix 
= Gi Ar Qiz 
A=] a, Aza Ags 
Qo, Qos Aggy 


The same matrix A is obtained if we interchange the second and 
third rows in the unit matrix 


1 0 0j 
E=!0 1 0 
0 0 1 


and postmultipfy the resulting matrix 


moo 
E=|0 0 1 
010 


by the matrix A, that is A= EA. 

Other elementary transformations are carried out in a similar 
manner. Note that if in the equation AA~1=E£ we perform iden- 
tical transformations of the rows of the matrices A and £ until A 
becomes a unit matrix, then we will have EAA~!=E, where Ē is 
the transformed unit matrix. Then, since EA = E, we get A1=E, 
that is to say, the inverse matrix A`! is a transformed unit mat- 
rix. This is the basis of the method of computing an inverse by 
means of row transformations [4]. 


7.15 COMPUTATION OF DETERMINANTS 


Elementary matrix transformations offer the most convenient 
method for computing the determinant of a matrix. Suppose, for 
example, we have 


A, = Gy, Ozz ves Bon (1): 


= 5 6 ee we te we 


276 Ch. 7. Matrix Algebra 


Assuming that a £0, we have 


i J Ay, : ay, q 
oa a. a, 
22 2n 


An =41| an 


Here, subtracting from the elements a,, of the jth column (j > 2) 
the corresponding elements of the first ‘column multiplied by aj, 
we get 


1I 0... 00 


Qt cw a) 
an Ag 2 e+ Ayn 
A,=ay,| > = nå,- 
Qni Ce a) 
Laz Anz Arin 
where 
1) 1) 
o9 a Ain 
TE a Oat : 
A, siS (2) 
a a) @ 
One Ong Qian 
and 
ail Ay; 
ay = t Fi . ; Di 7 
ar dyme — Oy i siege 
(7 an ai (i, J , ) 


Apply the same technique to the determinant A,_,. If all the 
elements 


ag-? 0 cans Cee 2,..., n) 
we finally obtain 
A, = 404., ar? (3) 


If in some intermediate determinant A,_, the upper left ele- 
ment af?,,4,1==0, then it is necessary to interchange the rows or 
columns of the determinant A,_, so that the element we need is 
_ nonzero (this is always possible if the determinant A=+0). Of 
course we must take into consideration the change in the sign of 
A,-» We can give a more general rule. Suppose the determinant 


„= det [@;;] is transformed so that a,,=1 (a,, is the principal 


7.15 Computation of determinants 274 


element), that is, 


Qir +++ Gig +++ Oy o hin 
Za. igl Qij in 
Arg Vadis eo et ws 
Op 1... |@pjf +. on 
Ony Ang ká bn j ban 


Then 
A, =(— 1)P+9 At 

where A,_, = det [a2] is a determinant of the (n—1)th order ob- 

tained from A, by deleting the pth row and the gth column with 

a subsequent transformation of the elements via the formula 
aP = oa 


i Mig pi 


Thus, each element af! of the determinant A,-1 is equal to the 
corresponding element a; of the determinant A, diminished by the 
product of its “projections” œi, and æ by the discarded column and 
row of the original determinant. The proof of this proposition fol- 
lows readily from the general properties of determinants [7]. 


Example. Evaluate 
3 
2 
A, = 1 
5 
l 


Solution. Taking a= 1 for the principal element, we have 


| —2 —3-3 3 8 1 eye 4-23 
1—3. 4 —4. 2 —(—1). 2. 
As = Coma gels 5 eer —2 —1-(—1) —3 Ba if a = 
—1 —3-2 1 —1-2 2—(—1).2 3 22 
i 0 4—2 
—2 3 3 f 
Tn Rea T 


272 Ch, 7. Matrix Algebra 


Now taking a,,=1 for the principal element and applying a 
similar transformation, we obtain 


—15 6 10 —15 3 10 
Ay=(—1)®| 22 —22 —25 |=2| 22 m-11 -——25 |= 
—9 2 7 —9 fi] 7 
12 —I11 
= 2.(—~])3+2 = 
(—1) -77 52 446 








Note that the number of multiplications and divisions required 
in the computation of an nth-order determinant is [8] equal to 


n—l 


5+ nr +n 3) 





REFERENCES FOR CHAPTER 7 


[1] O., Schreier und E. Sperner, Vorlesungen über Matrizen, 1932, Secs. 1, 2 

[2] A. Maltsev, Principles of Linear Algebra, 1956 (in Russian). 

[3] V. Faddeyeva, Computational Methods of Linear Algebra, 1950 (in Rus- 

sian). ; 

R. A. Frazer, W. J. Duncan, A. R. Collar, Elementary Matrices and Some 

Applications to Dynamics and Differential Equations, 1946. 

B. V. Bulgakov, Oscillations, 1954, Chapter I (in Russian). 

[6] E. S. Lyapin, Course of Higher Algebra, 1953, Chapter IX (in Russian). 

[7] E. T. Whittaker and G. Robinson, The Calculus of Observations, 1944, 
Chapter V. 

[8] D. K. Faddeyev and V. N, Faddeyeva, Computational Methods of Linear 
Algebra, 1960, Chapter II (in Russian). 


Chapter 8 
SOLVING SYSTEMS OF LINEAR EQUATIONS 


8.1 A GENERAL DESCRIPTION OF METHODS 
OF SOLVING SYSTEMS OF LINEAR EQUATIONS 


Methods of solving systems of linear equations are divided mainly 
into two groups: (1) exact methods, which are finite algorithms for 
computing the roots of a system (such, for example, are Cramer's 
rule, the Gaussian method, the method of principal elements, the 
method of square roots, etc.), and (2) iterative methods, which 
permit obtaining the roots of a system to a given accuracy by 
means of convergent infinite processes (these include the method 
of iteration, the Seidel method, the method of relaxation, and others). 

Because of unavoidable rounding, the results of even the exact 
methods are approximate, and error estimates of the roots are, in 
the general case, involved. In the case of iterative processes we 
have the added error of method. 

It will be noted that the effective use of iterative methods de- 
pends essentially on the apt choice of the initial approximation 
and the speed of convergence of the process. 


8.2 SOLUTION BY INVERSION OF MATRICES. 
CRAMER’S RULE l 
Suppose we have a system of n linear equations in n unknowns: 


Ay X, + aiat +... FOX, = bis 
da1% F Azaka b+ F Aonta = bas 


>» s č » >» č s ç» č e č © č > > č ò% č > ç% ç +% (1) 
an1% F Anča + ROR +4,,%p PS b, 
Denote the matrix of the coefficients of (1) by 
Gy, Ay e Ay, 
A= Qas Qog +++ Bon (2) 


Anı Ung >> Q 


18 9616 


274 Ch. 8. Solving Systems of Linear Equations 


the column of its constant terms by 
rb, 
b, 
b=] ` (3) 


b 
LYn_J 
and the column of the unknowns (the desired vector) by 


al (4) 


Lin 
Then the system (1) can be compactly written in the form of a 
matrix equation: 
Ax =b (5) 


The set of numbers x,, x,,..., x, (or, briefly, the vector x) which 
reduce (1) to an identity is called the solution set of the system, 
and the numbers x; are termed the roots of the system. 

If. the matrix A is nonsingular, that is, 


Qir Qiz ++ Ay 
a a x a 
21 22 aa on 
det A=) oP =4#0 (6) 
Ons Ane Gan 


then (1}, or the matrix equation (5) equivalent to it, has a unique 
solution. 

Indeed, provided det A=0, there is an inverse matrix A`! 
Premultiplying both members of (5) by the matrix A71, we obtain 


AAx=A™6 
x=Ab ` (7) 


Formula (7) plainly yields a solution of (5) and since every solu- 
tion is of the form (7), the solution is unique. 


Example 1. Solve the system of equations 


3X, —X, = 5, 
—2x,+%,+%, =), 


or 


2x, — x4, + 4x, = 15 


8.2 Solution by inversion of matrices. Cramer’s rule 275 


Solution, Write the system in matrix form: 


3 —l 0] fx 5 
—2 LAl lela 0 
2 —l 4l lz 15 


The determinant of matrix A of the given system is 








3—1 0 
2—1 4 
Computing the inverse matrix A71, we obtain 
ete Sa 
fe Se 
2 > 3 
s aaa 
1 1 
Ga. 2 
Lo 5o 5) 
whence ‘ 
Cpe ec a hal 
% l=) 22—=|| ola] 1 
l 1 
x O- = 15 3 
es aoe Re ea AoE 


Thus, 4.2, = oS 3. 
A great deal of time is required to find the inverse A™ of a 
matrix A of order n> 4 directly. For this reason, formula (7) is 


rarely used for practical purposes. 
Using formula (7), it is easy to obtain formulas for the unknowns 


of the system (1). As we know (see Sec. 7.4), 
= 1 > 
A ER A 


where 


is the adjoint of A (A;, are the cofactors of the elements a;). 


Therefore 


] ~ 
x= 7 Ab 


276 Ch. 8. Solving Systems of Linear Equations 


or 
ži TA, | 
Xg A, 
] . 
ei (8) 
L*n j [An | 
where 
Oy, vee Ay, 7-4 bi Qy ity ore Qin 
n s 
. Chg ore Az ag, Oe Ugri ee Oe 
A= 2) Ab; N e aaee a ee aT Aae es Set nee a a ar gl S 
Any b On, ia bn On, iti Ann 


are the determinants obtained from the determinant A [formula (6)] 
by replacing its ith column by the column of constant terms of 
system (1). From equation (8) we get Cramer’s formulas: 


Ay SER Ay An 


m= RET? er., n= (9) 


Thus, if the determinant of system (1) is nonzero, A0, then 
the system has a unique solution x defined by the matrix formula 
(7) or by the scalar formulas (9) equivalent to it. 


Example 2. Solve the system of linear equations 
, 2x, -+x — 5x +x, = 8, 
xı — 3x, — 6x, =9, 
2x, — xX; + 2x, =—5, 
Xi + 4x, —7x,+6x, =0 
Solution. The determinant of this system is 
2 1 —5 l 


1 —3 0 —6 
L. 4 —7 6 


Computing the supplementary determinants, we get 
8 1 —5 l 
9 —3 0 —6 
Ai=|_5 2 —ı 2|58h 
0 4 —7 6 


8.3 Gaussian method 277 


2 8—5 I 
1 9 0 —ô 
^=l0 —5 --1 2/=—108, 
1 0—7 6 
E a e 
1 —3 9 ese 
Amig gp ajm 
1 4 0 6 
2 1 25 8 
I —3 0 9 
B10 Da a 
1 4-7 D 
whence . 
A 81 
yaZ=a=s, 
A 108 
aes aa ata 
A 27 
X3 A= 7 l, 
E 
MEA a 


Thus, the solution of a linear system (1) in n unknowns reduces 
to evaluating the (n+ 1)th determinant of order n. If n is great, 
evaluating the determinants is a laborious operation. For this reason, 
direct techniques have been elaborated for finding the roots of a 
linear system of equations, 


8.3 THE GAUSSIAN METHOD 


The most common technique for the solution of systems of linear 
equations is via an algorithm for the successive elimination of the 
unknowns. This method is called the Gaussian method. For the sake’ 
of simplicity, we confine ourselves to a system of four equations in 
four unknowns: 

hy yXy F Ay eXp + Ay gXy + Ay yh, = Giss 

Boy Xy F Arar F AgyXs F ApgXq = Aes, 

Gy X 1 T AgeXy + BggXy F AyggXq = Ags, (1) 
hy Xy T AgnXy + Ag g%y F Aaa = ys 


278 Ch. 8. Solving Systems of Linear Equations 


Let the leading element a, 0. Dividing the coefficients of the 
first equation of (1) by a,,, we get 
Xa -+ bia% + b13% F 014% y = bis (2) 
where 
a n 
b= >) 


Ay 


Using equation (2), it is easy to eliminate the unknown x, from 
the system (1). To do this, subtract (2) multiplied by a,, from the’ 
second equation of (1), subtract (2) multiplied by a,, from the third 
equation of (1), and so on, We finally get a system consisting of 
three equations: 


a) a) Dy — g) 
a53 Xa F A33 X3 + Ugg X4 = Aggy 

o) D Dy — Ath 
Air Xa + Ags Xg + Aig X4 = Ais 


(1°) 


a 1 1 
ai Xa + O33 Xs +a, =a, 


where the coefficients a$? (i, jæ 2) are computed from the 
formula - 
; al =a — aby (i, j> 2) 
Now, dividing the coefficients of the first equation of (1’) by the 
leading element aj}, we get the equation 
| Sy FOB xy + Dx, = OB (2) 
where : 
as) . 
oy = (J >2) 
a 


(1) 
22 
Eliminating x, in the same way that we eliminated x,, we arrive 
at the following system of equations: 
ax + ax, = awe, j 
aga + altx, =a 2 
where 
a9 = ap —apoy (i, j>3) 
Dividing the coefficients of the first equation of (1”) by the 
leading element a, we get 
ky + 09x, = be 2") 
where 


(2) 
=m G3) 


(2) 


Now, eliminating x, in the same fashion from (1”), we get 
ax, = al (1) 


8.3 Gaussian method 279 


where 
a= ap —aiposp (i, | >) 
whence 
ag? 63) PNGA 
a = — oy = 45 (9 
aga 


The remaining unknowns are successively determined from the 
equations (2”), (2) and (2): 


(2) w 
x = 03? — bR X, 


1) 1) a) 
x, = by — by X4 — by x3, 


= bys —OyX, — Dy 9X; — bia% 


Thus, the process of solving a linear system of equations by the 
Gaussian method reduces to the construction of an equivalent system 
(2), (2°), (2), (2°) having a triangular matrix. A necessary and 
sufficient condition for using this method is that all the leading 
elements be nonzero. The computations can be conveniently arran- 
ged as shown in Table 13. The scheme in the table is called 
the scheme of unique division. The process of finding the coeffici- 
ents O%-” of a triangular system may be called forward substitu- 
tion (direct procedure), the process of finding the values of the 
unknowns is called back substitution (reverse procedure). 

The direct procedure begins with writing out the coefficients of 
the system, including the constant terms (Section A). The last 
row of Section A is the result of dividing the first row of the 
section by the leading element a,,. The elements aj? (i, j > 2) of 
the next section of the scheme (Section A,) are equal to the cor- 
responding elements a,,; of the preceding section minus the product 
of their “projections” by the lines of Section A containing element 1 
(that is to say, by the first column and last row). 

The last row of Section A, is found by means of dividing the 
first row of the section by the leading element aj}. The subsequent 
sections are constructed in similar fashion. The direct procedure 
is terminated when we reach the section consisting of one row, 
not counting the transformed row (Section A, in our particular 
case). 

A back substitution (the reverse procedure), use is made only 
of the rows of sections A, “containing units (marked rows), begin- 
ning with the last. The element b? of Section A, in the column 
of constant terms of the marked row of the section yields the value 
of x, From then on, all the other unknowns x; ((=38, 2, 1) are 
found step by step by means of subtracting from the constant term 


280 Ch, 8. Solving Systems of Linear Equations 


TABLE 13 
SCHEME OF UNIQUE DIVISION 


Constant Sections 
of scheme 





of the marked row the sums of the products of its coefficients by 
the corresponding values of the earlier found unknowns. The values 
of the unknowns are written out in succession in the last section, 
Section B. The units in this section help to locate for x; the res- 
pective coefficients in the marked rows. 


8.3 Gaussian methed 281 


The computations are checked by so-called “check sums” 
5 . 
aip= Day (i=1, 2, ..., 5) (3) 


which are located in the column labelled $} and are the sums of 
the elements of the rows of the matrix of the original system (1) 
including the constant terms. 

If a; is taken as the new constant terms in system (1), then 
the transformed linear system 


, 


4 
24%) = An (i=1, 2, 3, 4) (4) 


will have the unknowns x; connected with the earlier unknowns 
x; by the relations 


x=xt1 (j=, 2, 3, 4) (5) 


Indeed, substituting formulas (5) into equation (4), we get, by 
virtue of the system (1) and formulas (3), the identity 


4 4 5 
Dap t Day= > Qij = Aig (j=1, 2, 3, 4) 
j=l j=l j=l 


Generally, if we perform the same operations with the check 
sums in each row as with the remaining elements of that row, then 
in the absence of errors in the computations, the elements of the 
column headed $; will be equal to the sums of the elements of 
the corresponding transformed rows. This serves as a check on the 
direct procedure. The reverse procedure is checked by finding the 


numbers E7 which must coincide with the numbers x, +1. 


Example. Solve the system 


7.9x, + 5.6%, + 5.7x, —7.2x, = 6.68, 

8.5x, —4.8x, +0.8x, + 3.5x, = 9.95, (6) 
4.3x,+4.2x,—3.2x,+9.3x%, = 8.6, 

3.2x, — 1.4x, —8.9x, + 3.3%, = 1 


Solution. In Section A of Table 14 write down the matrix of the 
coefficients of the system, its constant terms and the check sums. 
Then fill in the last (fifth) row of Section A, dividing the first 
© row by 7.9 (by an). 


282 


Ch. 8. Solving Systems of Linear Equations 


TABLE 14 


SOLUTION OF A SYSTEM OF EQUATIONS BY THE SCHEME 
OF UNIQUE DIVISION 


0.72152 


10,8253! 
1.15190 
—3,66835 


—5 . 33292 
—6.30254 
—11.20886 


0. 49263 


—6.87000 
—9.40172 


—0.91139 


1124682 
13.21898 
6.21645 


— 1.03894 


14.41573 
2.40525 


Constant 
terms 


oss | 


2.76265 
4.96405 
—1.70582 


| —0. 25520 


5.25801 


2.36456 


—2.14876 
13 ,03239 
—10,36658 


“0. 19849 


12, 


0.12480 
0.96710 





Then start filling in Section A, of the table. Taking any element 
of Section A (not in the first row), subtract from it the product 
of the first element of its row by the last element of the column 

‘it belongs to, and record it in the appropriate place in Section A, 
of the scheme. For instance, choosing a,, =— 8.9, we have 


— 8.9—3.2-0.72152 = — 11.20886 


(a as 
ai? = Qag — Aabi = 


8.3 Gaussian method 283 


To obtain the last row of Section A,, divide all terms of the 
first row of that section by af? = —- 10.82531. For example, 


al) —- —5.33292 


23 


= -T = 0.49263 
al)  —10.82531 


GQ) pn 
bg = 


All the remaining sections of the table are filled in similarly. 
For instance, 
a? = a? —ap bY = 6.21645 — (—3.66835)- (— 1.03894) = 2.40525 


The unknowns are found by using the rows containing units, 
beginning with the last one (marked rows). The unknown x, is the 
constant term of the last row of Section A,: 


x, = b® = 0.56790 


The values of the other unknowns x, *,, x, are obtained in 
succession by subtracting from the constant terms of the marked 
rows the sums of the products of the corresponding coefficients bs? 
by the earlier found values of the unknowns. 

We have 

x, =b — bx, = —0.76536 —(—2.09836) - 0.56790 = 0.42630, 
x, = bw —bBx,—b@x, = 
= —0.25520 — (—1.03894)- 0.56790 — 0.49263 - 0.42630 = 0.12480, 
X, = bis —0,,%,— 05%, — b14, = 0.84557 —(—0.91139) x 
x 0.56790— 0.72152. 0.42630 —0.70886 - 0.12480 = 0.96710 
thus, 
x, = 0.96710, x,=0.12480, x,= 0.42630, x,= 0.56790 


An intermediate check of the computations is carried out by 
means of the 2 column on which are performed all the operations 
performed on the other columns. 

Thus, (1) the sum of the elements of each row of the scheme 


(the elements not belonging to the >) column) must be equal to 


the element of that row of the column; (2) the roots x; that 
correspond to the column must be greater by unity than the 
corresponding roots of the system. 

Incidentally, if we take into account the units written in Sec- 
tion B, then again the elements of the $, column in this section 
are sums of the elements of the rows corresponding to them. In 
our case, the first and second conditions are valid to within unity 
of the last digit place. It is therefore almost definite that the 
computations were performed correctly. 

Note that if the matrix of the system is symmetric, the corre- 
sponding parts of the sections A, A,, A,, ... of the scheme of 


284 Ch. 8. Solving Systems of Linear Equations 


unique division are also symmetric. This circumstance can be 
utilized to simplify the table. 

It is easy to estimate the number of arithmetic operations, N, 
necessary to solve a system of linear equations in n unknowns by 
the Gaussian method [5] (not counting those for checking). 

The direct procedure requires the following number of multipli- 
cations and divisions: 


n(n +1+(n—-Wn+...41-2= 


= (194284... fr $ (1424... +0) = teres 


and as many subtractions. The reverse procedure requires rin) 
multiplications and divisions andthesame number of subtractions. 
Hence, the total number of arithmetic operations in the Gaussian 
method is 


N= BOPOet) L n(n) <n 


for n> 7. 

Thus, the time required for the solution of a linear system by 
the Gaussian method is roughly proportional to the cube of the 
number of unknowns. For example, to solve a system of 100 linear 
equations in 100 unknowns by the Gaussian method on a compu- 
ter capable of 10* operations per second requires 


T = 108. 10-4= 100 seconds 


The actual machine time will be considerably greater because 
of other operations in the routine besides arithmetic operations 
(address substitution, logical operations, sending, shaping, etc.). 


8.4 IMPROVING ROOTS 


Approximate values of roots obtained by the Gaussian method 
can be improved. We will show how this is done if the correc- 
tions to the roots are small in absolute value. 

Suppose an approximate solution ¥, is found for the system 


Ax=b 
Setting 
x=24,+6 
ô, 
we then get, for the correction ê= _ | of the root Zos 
6, 


n 


A (x, +8) =6 


8.4 improving roots 285 


or 
Ab=e 


where e=b— Ax, is the residual of the approximate solution x,. 
Thus, in order to find 6, it is necessary to solve a linear system 
Ted 


with the earlier matrix A and a new constant term s= ; . To 


E 
do this, all we need is to adjoin to the main computational scheme 
a column s of constant terms and transform it by the general 
rules. As usual; the corrections 6,, ô, ..., 6, are found from 
the marked rows, the coefficients of these unknown corrections 
being already given in the table. Note that the transformed coef- 
ficients of matrix A need not be improved since for small resi- 
duals the corresponding errors will have a higher order of smallness. 


Example. Solve the following system of equations by the Gaussian 
method to three places (say, by slide rule or hand): 
6x, —x,—x, = 11.33, 
—x,+6x,—x, = 32, (1) 
— X, — x, + 6x, = 42 
Using the values thus obtained as initial approximations, improve 
the roots to 1074. 


Solution. Use the ordinary scheme of unique division (Table 15) 
and carry out all operations with three significant digits. 
The approximate values of the roots are: 


xi? =4.67, xf? =7.62, xP —9.05 


Substituting these values in the given system (1), we compute the 
appropriate residuals [that is, the differences between the right 
. and left members of system (1)]: 


e®=— 0.02, e&=0, e= — 0.0] 


Using these values as constant terms (Table 15), we obtain the 
corrections of the roots; 


&® = —0.0039, ô —— 0.0011, 65° =—0.0025 


whence we get the improved values of the roots: 
x,=4.6661, x,=7.6189, x,= 9.0475 


the residuals being equal to 
ê =— 2.107!, b= 22105 *, ô, =0 


286 Ch. 8. Solving Systems of Linear Equations 


TABLE 15 
REFINING ROOTS COMPUTED BY THE GAUSSIAN METHOD 


Constant terms | | Residual € 


11.33 5 —0.02 
32 





It is sometimes required to determine a possible error Ax of the . 
root x of a linear system on the basis of known small errors AA 
and Ab of the matrix A of the system and its constant term b. 


We have 
Ax=b (2) 
and 
(A+ AA) (x+ Ax) =b+ Ab (3) 
From this, neglecting the small term AA- Ax, we obtain 
Ax + AAx +AAx =b + Ab 


or 
AAxx = Ab—AAxX (4) 


8.5 Method ef principal elements 287 


It is thus possible, when seeking Ax approximately, to use the 
Gaussian scheme for the basic system (2) by augmenting the scheme 
with a new column of constant terms, A@—AA¥x. i 


8.5 THE METHOD OF PRINCIPAL ELEMENTS 


Suppose we have a linear system 


Ay yXy F aiaa + 66. FOX, =A, naar 
Ay X, F aaa + ENE + AonXn =h, nti’ 


(1) 


AniXy F ana F -oo F Annn =n, n+1 


Consider the augmented rectangular matrix consisting of the 
coefficients of the system and its constant terms, 














| Air Ar eee Aj onn aig Ain A,n | 
"Aa Aag » As; Ong » Bon Qz, n+ 
Qi An. Gij Qig Ain Qi, n+1 
MES aaa a er oe i Goa a E a 
ap A pz Qn; Aog Apn ap, nti 
Cti Bag set iyaa Oa “alas As Ae ke 


Choose a nonzero’ (as a rule, the numerically largest) element 
a, Of matrix M not belonging to the column of constant terms 
(g#n-+1), this element being called the principal element. Com- 
pute the multipliers 

Gig 
pS 
for all isp. á 

The row of M with index p which contains the principal ele- 
ment is called the principal row. Then perform the following ope- 
ration: to each nonprincipal row add the principal row multiplied 
by the appropriate multiplier m; of the row. We thus obtain a 
new matrix in which the gth column consists of zeros. Discarding 
this column and the principal pth row, we obtain a new matrix 
M™ with the number of rows and columns diminished by unity. 

Repeat these operations with matrix M®™ to get matrix M™, 
etc. Thus, we obtain a sequence of matrices 


(1) m=i} 
M, i wine 


288 Ch. 8. Solving Systems of Linear Equations 


the last of which is a two-term row matrix. It is also regarded 
as the principal row. : 

To determine the unknowns x,, combine into a system all the 
principal rows, beginning with the last, which enters into matrix 
M”, 

After an appropriate change in the numbering of the unknowns 
we get a system with a triangular matrix, from which it is easy 
to obtain, step by step, the unknowns of the given system (1). 
The method of principal elements is always applicable if the 
determinant of the system 


det A = 


©.» s o 








Note that the Gaussian method is a particular case of the method 
of principal elements and the Gaussian scheme is obtained if for 
the principal element we always choose the upper left element of 
the corresponding matrix. 


8.6 USE OF THE GAUSSIAN METHOD 
IN COMPUTING DETERMINANTS 


Suppose 
Q Bee Qin 
le s (1) 
Any Ong Onn 
and 
A=detA (2) 
-~ Consider the linear system 
Ax =0 (3) 


When solving (3) by the Gaussian method we replaced matrix A 
by the triangular matrix B consisting of elements of markéd rows, 


j lba bi ey bp j 
O1 bY 1... bY 


We obtained an equivalent system: 
Be =0 (4) 


8.6 Gaussian method for computing determinants 289 


The elements of B were successively obtained from the elements 


of A and the subsequent auxiliary matrices A,, A, ..., Ana 
with the aid of the following elementary transformations: 
(1) division by the leading elements a,,, a, ..., asy, which 


were assumed nonzero, and 

(2) subtraction, from the.rows of A and the intermediate mat- 
rices A; (i=1, 2, ..., n—1), of scalars proportional to the ele- 
ments of the corresponding leading rows. In the first operation, 
the determinant of the matrix is also divided by the appropriate 
leading element, in the second, the deteriiinant of the matrix re- 
mains unchanged. Therefore 


‘det A 
Hence 
A= det A = qar a0 (5) 


that is, the determinant is equal to the product of the leading ele- 
ments of the corresponding Gaussian scheme. We conclude from this 
that the scheme of unique division given in Sec. 8.3 may be used 
for computing determinants (in this case the column of constant 
terms is superfluous). 

Note that if for any step the element a¥-?=0 or is close to 
zero (this implies a reduction in the accuracy of the computations), 
one should appropriately change the order of the rows and columns 
of the matrix. 


Example. Evaluate the determinant 


7.4 2.2 —3.1 0.7 
16 4.8 —8.5 4.5 
4.7 70 —6.0 6.6 
5.9 2.7 4.9 —5.3 


Solution. Using the elements of the determinant A, form a uni- 
que-division scheme (Table 16). 
Multiplying together the leading elements (in frames), we get 


A =7.4-4.32434-6.11331 -(—-7.58393) = —1483.61867 


It is worth noting the following. To solve a system of n linear. 
equations in n unknowns by Cramer’s formulas, one has to eva- 
luate n+ 1 determinants of the nth order. Now by the unique-di- 
vision scheme, to compute one determinant of the nth order requires 
nearly the same volume of work as the complete solution of the 
system of equations. It is therefore, generally speaking, not advi- 
sable to use Cramer’s rule for a numerical solution of. a linear 
system of equations for n> 3. > 


A= 


19 9616 


290 Ch. 8. Seiving Systems of Linear Equations 


TABLE 16 
EVALUATING A DETERMINANT BY THE GAUSSIAN METHOD 





Ist column | 2nd column 3rd column 4th column > 


ae 





7.0 —6,0 6.6 


2.7 4.9 


4.8 —8.5 4,5 
—5.3 


1 | 0.29729 —0,4189i 0.09459 0.97297 











4,32434 —7 82974 4.34866 0, 84326 
5.60274 —4.03112 6. 15543 7.72705 | 4 
0,94599 7.37157` —5.85808 2,45948 | “1 
l : | - 8108 | 1.00562 | 0. 19500 
6.11331 | 0.52120 6.63451 
9.08440 —6 80939 2,27501 | 42 
1 | 0.08526 | 1.08526 
| | —1 58505 | —7 58393 | A, 














| : | | A= —1483.61867 


8.7 INVERSION OF MATRICES BY THE GAUSSIAN METHOD 
Suppose we have a nonsingular matrix 
A= fa] (i, j=l, 2, ..., n) (1) 
To find its inverse 
A= [x] (2) 


8.7. Inversion of matrices by the Gaussian method 291 


we use the basic relation 
AA1=E (3) 
where E is the unit matrix. 
Multiplying matrices A and A™?, we get n systems of equations 
in n? unknowns x; . 


P (i, j=1, 2, ..., n) 


where 

_f tl when i=j, 

Y \0 when i#j 

The resulting n systems’of linear equations for j=1,2,..., n 


having the same matrix A and distinct constant terms can be 
solved simultaneously by the Gaussian method. 


Example. Find the inverse A~! of the matrix 


1.8 —3.8 0.7 —3.71 
0.7 2,1 —2.6 —2.8 
7.3 8.1 17 —4.9 


119 —43 ~-49 —4.71 


ô 


A= 








Solution. Form a unique-division scheme. We will have four co- 
lumns of constant terms (Table 17). Note that the elements of 
the rows of the inverse matrix are obtained in reverse order. 

On the basis of the results of Table 17, we get 


F—0.21121 --0.46003 0.16284 | 0.269567) 
—0.03533 0.16873 0.01573 —0.08920 
0.23030 0.04607 —0.00944 —0.19885 


| —0.29316 — 0.38837 0.06128 0.18513 j 


To check, form the product 

1.8 —3.8 07 —3.7 
0.7 2.1 —2.6 —2.8 
73 8l 1.7 —4.9 
1.9 —4.3 —4.9 —4.7 | 


A= 








AA- = 





L 


[ —0.21121 —0.46003 0.16284 0.269567 
— 0.03533 0.16873 0.01573 —0.08920 | _ 

0.23030 0.04607 —0.00944 —0.19885 | 
— 0.29316 —0.38837 0.06128 0.18513 





L 


TABLE 17 
COMPUTING THE INVERSE MATRIX BY THE GAUSSIAN METHOD 










































































Kaj X3j Xaj 
1.8 -3.8 0.7  |—3.7 
0.7 2.1 —2.6 —2.8 
7.30 8.1 1.7 —4.9 
1.9 —4.3 —4.9 —4.7 

—2.1111:] 0.38889|—2.05556] 0.55556 0 0 0 —2 22223 

3,57778|—2 , 87222! 1 3611 1|—0. 38885 l 0 0 —0.04440 

4 23.511101 1.13890} 10.10559)—4.0555I, 0 l 0 29 . 42228 

—0. 28889! —5 .63889;—0 .79444|— i .05554| 0 0 1 —6 77776 

EINE T T EEE E E AA E E A A E A E A E AEA E EOT a E A 

I —0. 80279 —0 . 38043|—0. 10868) 0.27950 0 0 0.01241 

17.73557} 19. 04992)—1.50032) —6.57135 l Q 29.71405 

—5.87081/—0.90434|—1.08694| 0.08074 0 l —6.78134 

l 1.0741 |/—0. 08459] —0 .37 108 0.05638 0 1.67539 

5. 40155}—! .58355], —2.09780 0.33100 L 3.05456 

l —0.29316| —0.38837 0.06128 0.18513 0.56540 

l l 0.23030| 0.04607 | —0.00944 —0. 19885 1.06809 

—0.03533] 0.16873 0.01573 0.08920 ` 1.06013 

—0.2112!} —0 . 46003 0.16284 0.26956 0.76266 














8.8 Square-root method 293 


0.99997 0.00000 —0.00001 0.00000 
—0.00025 0.99997 —0.00002 —0.00039 | __ 


—0,00808 —0.01017 0.99982 0.00009 | ~ 
0.00000 0.00000 0.00000 1.00048 a 


(0.03 0.00 0:01 0.00 7 
0.25 0.03 0.02 0.39 
8.08 10.17 0.18 —0.09 
0.00 0.00 0.00 —0.48 | 





L 


=E — 107°. 





L 


We see that due to rounding the inverse is not quite exact. 
Below we give (see Sec. 8,15) a method for correcting the elements 
of an approxiinate inverse matrix. 


8.8 SQUARE-ROOT METHOD 


Suppose we have a linear system 
Ax =b (1) 
where A=[a,,) is a symmetric matrix, that is A’=[a,] =A. 
Then A may be given in the form of a product of two transposed 
triangular matrices: 








A=T'T . (2) 
where 
a eee tna be 007 
ix cee faf oag palts taet 
MOS E E PE E 


Multiplying together the matrices T” and T, we obtain the follo- 
wing equations which allow us to determine the elements t,; of 


matrix T: 
tity tty t-.. ttityaay (<f), \ 
htt... + =o; 
Whence we find in succession: 


— a ‘ 
ty =V an, t= (i >1), 
i=l 
Voy -F g (l<i<n), 
kzl. 
i=l 
aij- > fritkj 
k=l 


n= (i< 
t; =0 ior i>j 


(3) 


294 Ch. 8. Solving Systems of Linear Equations 


The system (1}has a definite unique solution if t,,£0 GH 1) 22x, 

n), since in that case 
det A= det 7’. det T = (det TY = (fita. -oban #0 

The coefficients of matrix T will be real if t} > 0. In the sequel 
we will not, generally speaking, assume this latter condition to 
be fulfilled. 

Given relation (2), the equation (1) is equivalent to two equa- 
tions: 

T'y=6 and Tx=y 

or, expanded, 


titi =, 
tro + borys = On, (4) 
finti F fonta + T Æ tanin =b, 
and 
biit bia H -e F binn = Yrs 
tooX,t reece tonXn = Yo, (5) 
tanXn ga Yn 
From this we successively obtain 
b 
Yı = A , 
i=l 6) 
bj— x 'niVe ( 
k= $ . 
Ui gg (i> 1) 
and 
— Yn 
star i | 
n 
Yi— > lipkp 7) 
{= (i <n) 


The foregoing method of solving a system of linear equations 
is-called the square-root method. Since matrix A is symmetric 


and matrix T is upper triangular, we need only write 5 (n+ 1) 


upper coefficients a; and t; (iz j) in the computational scheme. 
The checking procedure is the ordinary type with the aid of sums, 
all coefficients of the appropriate row are taken into account when 
forming a sum. 


8.8 Square-root methed 295 


Note that if for some sth row we have ?f2,< 0, then the cor- 
responding elements ¢,, will be imaginary. Formally, the method 
is applicable in this case as well. 

In practical applications of the square-root method, the direct 
procedure, by means of formulas (3) and (6), is used to compute 
successively the coefficients ¢;; and y,(i=1, 2, ..., n) and then 
the reverse procedure, by formula (7), is used to find the unknowns 
x,(i=n, n=l, ..., 

Example. Using the square-root method, solve the following system 
of equations: 

x, +3x,— 2x, — 2x, = 0.5, 

» 3x, +4x,—5x,-+ x,—3x,=5.4, 
— 2x, — 5x, + 3x,— 2x, + 2x, = 5.0, 
X,— 2%, + 5x, + 3x, =7.5, 

— 2x, —3x, + 2x, + 3x, + 4x, =3.3 

Solution. Enter the coefficients a; and the constant terms b, of 
the given system in the initial section, A, of the table (Table 18) 


TABLE 18 
SOLUTION OF A LINEAR SYSTEM BY THE SQUARE-ROOT METHOD 


= Sections of 
scheme 


0.5 0.5 
2,236] i|—0.44722/0. 4472i/—1.3416¢/—1.7471i|—1.7471i 
0,8944i) 2.01257) 1.5653¢)—7.5803i|—3.1081i 
3.0414 | 2,2194 |-2,2928 | 2.9679 
0.82211) 0.16437) 0.9859 


—6.8011 |—0.8996 | 0.1998 
—1,2017;—5.8004 | 0.1007 | 1.1992 





296 Ch. 8. Solving Systems of Linear Equations 


and compute the column marked X. Using formulas (3) and (6) 
and moving from row to row in succession, compute the coeffici- 
ents f; and the new constant terms y;, thus filling Section B of 
the table. 

For example, 


poa Da aaa basta Pee) 156531 


Compute the column labelled $; for a check. On the basis of 
formulas (7) we find the values of the unknowns x; and the checking 


values x;=x;+ 1, entering them in Section C. For example, 
__ ya— tasts —Ízatą _ — 7.58031 — 1,5652/ -0.1998— 2.01251-{—0,8996) _ 
See pga eS A 0.8944: 
=— 6.8011 


8,9 THE SCHEME OF KHALETSKY 
For the sake of convenience, we write the system of linear 
equations in matrix notation as 
Ax=6 (1) 


where A= [a] is a square matrix of order n and 


1 r 7 
ay, n+l 
= . g b= 
On, n+1 
L¥n_t 

are column vectors. Represent matrix A in the form of a product 
of a lower triangular matrix B=[b,] and an upper triangular 
matrix C = [c,,] with unit diagonal, thus, 








A= BC (2) 
where 
lba 0 007 n A 
B= bar bae- 0 an des 0 1 Con 
Ubm bna ban | LO 0 l J 


Then the elements 6,, and c,, are determined from the formulas 


ba = aa, 
j-1 
ee 3 
by = aj — È DikCpj (@>)>1) . (3) 


8.9 Scheme of Khaietsky 297 


and 
aay 
a 
l i-l : , j (4) 
cum 5 (ay— p2 Dinlg; ) (l<i<f) 


Whence the desired vector w may be computed from the chain 
of equations 


Systems (5) are readily solved since the matrices B and C are 
triangular: 


ai, n+i 


Yi kiai bir , 





i ( fad (6) 
Y= p Age > Dine (i >1) 
bii \ k=1 


and 


Xn = Yn» 
n ; 7 
i= Yi > Cip (<n) (7) 


k=i+1 


From formulas (6) it is evident that the numbers y; are advanta- 
geously computed together with the coefficients cz. This method 
is known as the scheme of Khaletsky. This scheme uses the ordi- 
nary type of checking by means of sums. 

Note that if the matrix A is symmetric, that is ay=a,, then 


5 ji . s 
C= p (i<j) 


Khaletsky’s scheme is convenient for machine computation since 
in this case the operations of “accumulation” (3) and (4) may be 
carried out without recording the intermediate results. 


Example. Solve the system 
“8x, + Xa Xa +24, =6, 
— 5x + *,+3x,—4x, = — 12, 
2%, + *%,— =l, 
x,-—5x, + 3x,—3x, =3 





Solution (see Table 19). 


19 


TABLE 

















= z Go — S N] <+ —_ — “ on 
N | ee 
o Tot 
en (= 
© 2 | = o lials |] -=- on os l-l» 
l | | | 
z —|— ee hee 
ite} 
E 
: E18 | 8 | « 
a + — o = (= = N 
| [ l | | 
nel 
Be = 
o B 
R ior) wD 
— oD - on © o N Ke) 
| — |__| ___]___}. fp E S S 
8 & | 8 
oS ~ ire) Rel 
_ = O ite) inp] © O oD 
p lap] O O oD 
= oo o 6 cs 
= B (= es 
a | | 
-= ine] wD CN — a _— 








8.9 Scheme of Khaletsky 299 


In the first section of Table 19 enter the matrix of coefficients 
of the system, the constant terms and the check sums. 

Then, since 6,,=a,;,(i=1, 2, 3, 4), the first column of Section 
I is moved to the first column of Section HI. 

To obtain the first row of Section II, divide all elements of the 
first row of Section I by the element a,,=6,,, by 3 in our case. 

We have 
1 
3 


D 
‘I 


=0.(3), 


Now fill in the second column of Section II beginning with the 
second row. Using formulas (3), determine bp: 


Dap = Gag — D101, = 1 — z i z) =5 2, 66(6), 
2 
bss = Agr — bal = 0—2 : +=— 37 0.(6), 
j ] 
by = Ogg — Og sly = — 5—1. M Sea 557 = 5.(3) 


Then, determining c,,(j=3, 4, 5, 6) by formulas (4), fill in the 
second row of Section H: 


Cos = Fe (0s — bnt) = = : [3— (— 5)-(—5)| =3 , 


Cuy = Bos (aa — bata) = (—4)—=(—5)> 5] A + , 


3 
(12) —(—5).2]=—$, 


(17) (5). = 


We now go to the third column and compute its elements b,, and 
b,, from formulas (3), and so on until we have filled in the whole 
of Section II. We thus get a staircase arrangement in Section II: 
column-row, column-row, etc. 

In Section III, we determine y; and x;(i=1, 2, 3, 4) using for- 
mulas (6) and (7). 


EJ 
2 
8 

if 3 

Co, = bis (das — Bo1C15) =e 
_ 3 
=F 


J 
Ca =E (azs — Dorlas 


300 Ch. 8. Selving Systems of Linear Equations 


Intermediate checking is done by means of the $ column, which 
is involved in the same operations as is the column of constant 
terms. 


8.10 THE METHOD OF ITERATION 


When a linear system has a large number of unknowns, the Gauss- 
ian scheme, which yields an exact solution, becomes very un- 
wieldy. Under such conditions it is sometimes more convenient to 
use approximate numerical methods for finding the roots of the 
system. One of these methods is the method of iteration, which we 
present here. 

Given a linear system 


Oiti F Aia + -o FOX = Oy, ` 
Ay X, + Ay Xo F o o 4+0,,%, = ba, 


Bete is whe tees one (1) 
Qnia F na +--+) tOnnX, = bn 
Introducing the matrices 
[Ai Qie -+e Qin Xi b, 
A= Qar aag pas Gon 3 x= i b= b, 
ee Kp bp 
we can write system (1) briefly as a matrix equation: 
Ax=b (1°) 


Assuming that the diagonal coefficients 
4,349 (i=l, 2, vans n) 
we solve the first equation of (1) for x,, the second for x, and so 
on. We then get the equivalent system 


x = Ba F aita F Oita +... F Oin 
Xa = Pa F aX F Aaaa H o oo + AonXn 


. . » o o» » >» o è o 5 è ù k b» è è s y 


Xa = Pn F Onta F Anta H o + An, n-1%n-1 


(2) 


where 


pat, Gye 2 ior iss] 


E f Pr 
Gii 4 ai; 


and a;,=0 for i=j(i, j=, 2, ..., 1). 


8.10 Méthod of iteration 301 


Introducing the matrices 


hy, Byy - e+ By B, 
anjn et | ama pall 
Oy Opa ess Oly b, 
we can write (2) in matrix form: 
x=B+ax - (2) 


We will solve system (2) by the method of successive approxi- 
mations. For the zeroth approximation we take, say, the column. 
of constant «= B. 

Then we consecutively construct the column matrices 

xc) = B Lagh 
(first approximation) 





KO x Bac? 
(second approximation), and so forth. 
Generally speaking, any (k-++1)th approximation is computed 
trom the formula ; 
x'D = B pHa (k=0, 1, 2, ...) . (3) 


If the sequence of approximations x®, #'?,..., 4", ... hasa 
limit 
x= lm x” 
k> œ 
then this. limit is the solution of system (2). Indeed, passing to 
the limit in (3), we have 
lim #44) = B +a lim x 
k> o k> œ 
‘or 
x=Bp+oax 
which is to say that the limiting vector x is the solution of sys- 
tem (2') and, consequently, of system (1). 
Let us write out the formulas of the approximations in full: 


x =P; 
n 

xn =f;+ 2 ot jx") (3°) 
fel 

(a ,;= 0; P= ee n, k=0, l, 2, ae) 


It will be noted that it is sometimes more advantageous to re- 
duce system (I) to (2) so that: the coefficients œ;; are not equal 


302 Ch. 8. Solving Systems of Linear Equations 


to zero. For instance, take the equation 
1,02x,—0-15x, = 2.7 


In order to apply the method of successive approximations it is 
natural to write it in the form 


x, = 2.7 — 0.02x, + 0.15x, 


Generally, having a system 
p> apxy=b; (i=1, 2, ..., n) 


we can put 
a= aP ap 


where aV 0. Then the given system is equivalent to the reduced 
system 


x= Bit Di py. =1, 2, sel) 


where 


b; ap ay eer 
Prop Oe aan a= r for iÆj 

For this reason, from now on we will not, generally speaking, 
assume that a,,=0. 

The method of successive approximations given by the formula 
(3) or (3’) is called the method of iteration. The process of itera- 
tion (3) converges rapidly, that is, the number of approximations 
necessary to obtain the roots of (1) to a given accuracy is small 
if the elements of the matrix a are small in absolute value. In 
other words, successful use of -the iteration . process requires that 
the moduli of the diagonal coefficients of system (1) be large in 
comparison with the moduli of the nondiagonal coefficients of this 
system (here the constant terms are immaterial). 


Example 1. Solve the system 
4x, + 0.24x,—0.08x, = 8, | 


0.09x, + 3x,—0.15x, = 9, 
0.04x, —0.08x, + 4x, = 20 


(4) 


by the method of iteration. G 


Solution. Here, the diagonal coefficients 4, 3, 4 of the system 
considerably exceed the remaining coefficients of the unknowns. 


8.10 Method of iteration 303 


Reduce this system to the normal form (2), 
x, = 2—0.06x, + 0.02x,, 
x, = 3—0.03x, + 0.05x,, 
x, =5—0.01x + 0.02x, 


(5) 


System (5) can be written in matrix form as 


x; 2 0 —0.06 oe x, 
x,{=|3/+)—0.03 0 0.05 || x, 
| Xs 5} |—0.01 0.02 0 | Xy 
For the zeroth approximations of the roots of (4) we take 
A= 2 eS. XS 5 
Substituting these values into the right members of (5), we get 
the first approximations of the roots: 
xP = 2—0.06-340.02-5= 1.92, 
xP =3—0.03-240.05-5=3.19, 
x =5— 0.01 -2 + 0.02.3 = 5.04 
Substituting these approximations into (5), we get the second 
approximations of the roots: 
x = 1.9094, x® = 3.1944, x = 5.0446 
Substituting again, we get the third approximations of the roots: 
xf = 1.90923, xf =3.19495, x = 5.04485, etc. 


The results of the computations are entered in Table 20. 


` TABLE 20 : 
SOLVING A LINEAR SYSTEM BY THE METHOD OF ITERATION 


3.19 5.04 
3.1944 5.0446 
3.19495 5.04485 





Note. When using the method of iteration [formula (3)], it is 
not necessary to take the column of constant terms as the zeroth 
approximation x‘. As will be shown below, the convergence of 
the iteration process depends solely on the properties of the mat- 
rix œ, note that when certain conditions are met, if this process 


304. Ch. 8. Solving Systems of Linear Equations 


converges for a certain choice of the initial approximation, it will 
converge to the same limiting vector for any other choice of.the 
initial approximation as well. For this reason, in the iteration 
process the initial vector x can be chosen arbitrarily. It is ad- 
visable, for the components of the initial vector, to take the ap- 
proximate values of roots of the system found as a reasonable 
guess. 

A converging iteration process is self-correcting; that is, an 
individual computational error will not affect the final result, 
since any erroneous approximation may be regarded as a new 
initial vector. 

Also, it is sometimes more convenient to compute not the ap- 
proximations as such but their differences. Introducing the nota- 
tions 

AM = x xlk-w (k=0, 1, 2, ...) 
we have, from formula (3), 
s= ppa (6) 
and 
x =P y axt» (7) 
Whence subtract (7) from (6) to get 
AUD = q (x — xt-D) Sg AM 
or 
AER GAM (k=1; 2, ...) (8) 
For the zeroth approximation we take 
AO = x) (9) 


Then the mth approximation is 
x” — > A% (10) 
k=O 


If, as usual, we put AW =x® =f, then (8) will hold true for 
k=0 as well, otherwise (8) does not hold for k= 0. From this we 
obtain the following procedure for computations based on kahi ver- 
sion of iteration: 

(1) if A® = x% =f, then 


A® aA" =atB  (k=0, 1, 2, ...) 


and 


, k k 
x) = 5 A’) = > ap 


s=0 s=0 


8.10 Method of iteration 305 


(2) but if A® =x p, then we find 
AD = xD — 4 Sag 4 B— x 


and assume 
A™® = GARD = gt-1A (k=1, 2, 3,.. .) 
Thus 


k ko 
ee = DAY = x 4 D as TIA WD 
s=0 s=] 


Example 2. Solve the system 
2x — nt x= —3, 
3x +5x,— 2x,=1, 
x,—4x, + 10x,=0 


(11) 


Solution, Reduce the system (11) to the form (2): 
x, = —1.5+0.5x,—0.5x,, 
x, = 0.2—0.6x, +0.4x,, 
x= —0.1x,+0.4x, 


Here 


0 0.5 —0.51 
a=|—0.6 0 0.4 


—0.104 0 


—1.5 
P| 02 
0 
Using formulas (8) and (9), we get 
—1.5 
=s- 02) 
0 
T 0 0.5 —0.5] o2 0.1 
AV =GA%=| 06 0 0.4 0.2{= {0.9 |, 
e 04 O i 0 | os 


0 0.5 —0.57] [0.1 0.335 
A® =aAY=|—0.6 0 0.4 0.9 |= | 0.032 
—0.1 0.4 0 0.23 0.350 


and so forth. The results are entered in Table 21. 


and 


ON ARIR 


306 Ch, & Solving Systems of Linear Equations 


TABLE 21 
SOLVING A LINEAR SYSTEM 
BY THE MODIFIED ITERATION METHOD 
(METHOD OF ACCUMULATION) 


- Oo 


bo 


3 
4 
5 
6 
7 
8 
9 





Thus the approximate values of the roots are 
x = — 1.235, x ,=1.089, x= 0.560 


A defect of this version of the method of iteration is the syste- 
matic accumulation of errors with increasing number of terms, and, 
as a result, considerable errors in the required roots. What is 
more, an error committed in the computations affects the final 
result, For this reason, the first version of the method of iteration 
is more reliable. 

Remarks concerning computational accuracy. If all the coefficients 
and constant terms of the given system are exact numbers, the 
solution by means of the method of successive approximations can 
be obtained to any preassigned number m of correct decimal places. 
In this case, retain m-+1 decimal places in the values of the suc- 
cessive approximations and compute the successive approximations 
until they coincide. Then round off one digit. If the coefficients 
and constant terms of the given system are approximate numbers, 
written to p digits, the solution of the system is carried to m= p 
digits, as in the case of exact numbers. 

We give without proof a sufficient condition for the convergence 
of the process of iteration (for the proof see Sec. 9.1). 


8.11 Adapting a linear system for iteration 307 


Theorem. /f for the reduced system (2) at least one of the follo- 
wing two conditions is valid: 


(1) B laul<! (i=1, 2, ..., n) 
or 
(2) lawl <) (j=1, 2, woes n) 


then the process of iteration (3) converges to a unique solution of 
the system irrespective of the choice of the initial approximation. 


Corollary; For the system 
È a= i=l 2a h) 
the method of iteration converges if the inequalities 
lan|> Blay]  @=1, 2, ..., n) 


hołd true, that is, if the moduli of the diagonal coefficients are 
- greater for each equation of the system than the sum of the mo- 
duli of all the remaining coefficients (disregarding the constant 
terms). 


8.11 REDUCING A LINEAR SYSTEM TO A FORM CONVENIENT 
FOR ITERATION 
The convergence theorem (Sec. 8.10) imposes stringent conditions 
on the coefficients of the given linear system 
Ax=b (1) 
However, if det A40, then by a linear combination of the 
equations of the system (1), this system can always be replaced 
by the equivalent- system 
x=BPp+ax (2) 
such that the conditions of the convergence theorem are valid. 
Indeed, multiply (1) by the matrix D= A~!—e, where e= [e;;] 
is a matrix with numerically small elements. Then we have 
(A~!1—e) Ax = Db 
or i 
where a=eA and B=Db. If |e,,| are sufficiently small, it is plain 


308 Ch. 8. Solving Systems of Linear Equations 


that the system (3) satisfies the conditions of the convergence 
theorem. 

Multiplication by the matrix D is equivalent to a set of ele- 
mentary transformations of the equations of the system. The pro- 
blem is to arrive at the standard form (3) with a minimum of 
effort. . 

A practical procedure is as follows. Extract from the given 
system those equations with coefficients whose moduli are greater 
than the sum of the moduli of the remaining coefficients of the 
equation. Each chosen equation is written in a row of the new 
system so that the numerically largest coefficient is a diagonal. 
coefficient. 

The remaining unused equations and the chosen equations of 
the system are made into linearly independent linear combinations 
in such a way that the above principle of forming a new system 
is fulfilled and all free rows are filled. Be sure that each equation 
that was not yet used is involved in at least one linear combina- 
tion forming an equation of the new system. We give an illustra- 
tive example. 


Example. Reduce the system 


(A) 2x, +3x,—4%,+-x,—3=0, 
(B) xı — 2x, — 5x} *,—2=0, 
(C) 544,—3%,-+ x, — 4x, —1=0, 
(D) 10x, + 2x, — x, +24, 4+4=0 


to a form suitable for applying the method of iteration. 


Solution. In equation (B) the coefficient of x, is greater (in mo- 
dulus) than the sum of the moduli of the other coefficients, and 
so we can take this equation for the third equation of the new 
system. The coefficient of x, in equation (D) is also greater than 
the sum of the moduli of the remaining coefficients of equation (D), 
and so this equation can be taken for the first equation of the 
new system. Thus, the new system looks like this: 


(1) 10x, + 2x,—x,+2x,+4=0, 
mee) GR ah OR Cac te te oe 
(III) xı — 2x,—5x,+4,—2=0, 
GV), Ge tg be ale. SRG 


Examining this system, we can easily see that in order to ob- 
tain equation (IJ) with maximum-modulus coefficient of x, it suf- 
fices to form the difference (A)—(C): 


(11) x, + 5x, +x, + 0x,--1=0 


8.12 Seidel method 309 


The new system now includes the equations (A), (B) and (D), 
and so equation (IV) must include equation (C) of the given system. 
A trial convinces us that we can take for equation (IV) the linear 
combination 2(A)—(B)+2(C)—(D): 

(IV) 3x, +0x,-+ 0x, —9x,—10=0 

We thus obtain the transformed system of equations I-IV, 
which is equivalent to the original system and satisfies the condi- 
tions of convergence of the iteration process. Solving this system 
for the diagonal unknowns, we get the system _ 

X=. Ox,—0.2x,+0.1x,—0.2x,—0.4, 
X,=0.2x, + Ox,—0.2x,+ Ox,+0.2, 
X,=0.2x, —0.4x + O0x,+0.2x,—0.4, 
X%,=0.333x,+ Ox,+ Ox,4+ Ox,—I1.111 


to which we can apply the method of iteration. 


8.12 THE SEIDEL METHOD > 


The Seidel method is a certain modification of the method of 
iteration. The principal idea behind it is that in computing the 
(k-+1)th approximation of the unknown x;, the earlier computed 
(k-+1)th approximations of the unknowns x,, Xas ..., Xj, are 
taken into account. 

Suppose we have a reduced linear system 


n 
x;=B,;+ È oxy (i= L, 2, eee, n) 
i= 


Arbitrarily choose the initial approximations of the roots 
Fe mer ee Ng 
attempting of course to have them roughly correspond to the de- 


sired unknowns 
A E Me 


Now, assuming that the th approximations x{” of the roots are 
known, we, following Seidel, construct the (k+ ijth approsimasions 
of the roots by the following formulas: 


k 
xt me By + a ajx? , 


sto — m B, apx tto Ba > Re a 


310 Ch. 8. Solving Systems of Linear Equations 


à t-1 n 
k 
EY = Bet È af? D ayp, 
j= J= 5 


xO = Ba + 2 On PD F oe (k=0, 1, 2, ...) 
J= 


Note that the convergence theorem given above (see Sec. 8.10) 
for simple iteration remains valid for iteration by the Seidel method 
(see Secs. 9.3 to 9.7). 

Ordinarily, the Seidel method yields a better convergence than 
does the method of simple iteration, what is more, the Seidel 
process may converge even when the process of iteration diverges. 
This does not always take place however. Cases are possible when 
the Seidel process converges more slowly than does the process of 
iteration. It also sometimes happens that the process of iteration 
converges while the Seidel process diverges [1] (see Sec. 11.6). 


Example. Solve the following system of equations by the Seidel 
method: 


2x, + 10x, + x,== 18, 


10x,+ n+ %*,=12, 
2x, + 2x,+10x,= 14 


Solution Reduce the system to a form convenient for iteration, 
x, = 1.2—0.1x,—0.1x,, 
x, = 1.3—0.2x, —0.1x,, | 
x, = 1.4—0.2x,—0.2x, 
For the zeroth approximations of the roots take 
IVl A 0 At 
Applying the Seidel process, we successively obtain 
x? =1.2—0.1-0 —0.1-0 =1.2, 
xP =1.3—0,2-1.2—0.1-0 =1.06, | 
x = 1.4—0.2-1.2—0.2. 1.06 = 0.948 


x® =1.2—0.1-1.06 —0.1-0.948 = 0.9992, | 
(1) 


(I) 


x = 1.3—0.2-0.9992—0.1-0.948 = 1.00536, 
xP = 1.4—0.2.0.9992 — 0.2. 1.00536 = 0.999098, etc. 


The results are computed correct to four decimal places and are 
tabulated in Table 22. 


8.13 Case of a normal system 311 


TABLE 22 


FINDING THE ROOTS OF A LINEAR SYSTEM 
BY THE SEIDEL METHOD 








The exact values of the roots are x= 1, x,=1, x,=1. 


8.13 THE CASE OF A NORMAL SYSTEM 


Definition 1. An integral homogeneous polynomial of second 
„ degree in n variables is called a quadratic form of these variables. 
“In the general case, a quadratic form looks like this 


U(X, Xos wey Xn) = i + apta + eT Annta H 
+ 20,41% F 203X14 + oo o +20, 3, nnna (1) 


where a; (i; j=1, 2, ..., m) are constants, for the sake of con- 
venience, the coefficients of i j are taken in the even form 2a,.. 
` Equating u to the constant c, we get the equation of'a central 
quadric surface: 
U(X, X_, - ++, %,) =e 


in n-dimensional space. 


If we put 
Qij = Aji (2) 
that is 2a; = a;j-- a,,, then formula (1) may be written compactly as 
U (Xis Xar eves í) = È p> Oj yX Xj (1’) 

The matrix 
A= [a;] (3) 


is called the matrix of the quadratic form (1’). By virtue of Con- 
' dition (2), matrix A will be symmetric, that is, it will coincide 
with its transpose. Contrariwise, for any symmetric matrix A = [a,;] 
it is possible to construct an associated quadratic form (1’). 


312 Ch. 8. Solving Systems of Linear Equations 


Definition 2. The quadratic form (1) is called positive (negative) 
definite if it assumes positive (negative) values, vanishing only for 


MSH HX, = 0 
Ii u(x, x, ..., Xa) is a positive definite quadratic form, then 
the equation 
U(Xy, a +++, X,)=C (e > 0) 

is the equation of an ellipsoid. Note that in this case 
. a;; >> 0 (i=l, 2, r.y n) 
since ' 

a,=u(l, 0, ..., 0) >0, 


A, =u (0, Les 0) >0, 


Definition 3. Let us call a linear system 
. È a=b, (i=1, 2, ..., n) (4) 


normal if (1) the ne A = [a;;] of the coefficients is symmetric, 
that is, ajy=a, (2) the corresponding | quadratic form 


u= bs > Qix% is oe definite. 
Normal: systems are encountered in the solution of many problems, 
.for instance in the method of least squares, when seeking the 
directions of the principal axes of an ellipsoid, etc. 
Reduce the normal system (4), in the ordinary way, to the 
special form 
x= D it; +B (i=1, 2, ..., A) (4°) 
where 
ay ; f 
a=% (ji) and B= 7 


aij ii 


Theorem. 1. /f the linear system (4) is normal, then the Seidel 
process will always converge for the reduced system (4) equivalent 
to it. 

Proof. See Sec. 11.5 and also [2]. 

How to reduce a linear system to normal form is indicated by 
the following theorem. 

Theorem 2. /f both members of the linear system 


Ax =6 (5) 


8.14 Method of relaxation 3143 
with nonsingular matrix A= (a; ] are premultiplied by the transpose 
A’ = [ay], then the resulting system 

A'Ax = A'b (6) 


is normal, 
We first prove that the matrix A'A is symmetric. Indeed, we have 


(A'AY =A'A"=A'A 


We now prove that the quadratic form associated with the matrix 
A'A is positive definite. We form the quadratic form with matrix 
A'A: Í 


n n n 
3 
u (%, Noy ery Xn) Sr 2 È 2 Ap ily jX 5X7 
=] j= = 


Changing the order of summation, we get 
n n n n N 
ei 
= Aki Ag iX j= 5 e AkiXi » ayt; ) 
Tist j=] k=1 \t=1 i f=l 


Since the value of the sum does not depend on the summation 
index, i 


n n N2 
u= $ (È a) >0 
=1 \f=1 


By hypothesis, det A = det [a;;] 40. Therefore the homogeneous 
system 


D a x;=0 (k=1,2,..., n) 


i=] 
has only trivial solutions. Consequently 
U(X, Xa eers X,) > 0 for |x,|/+]*,]/+...+]%,] 40 
The proof of the theorem is complete. 


8.14 THE METHOD OF RELAXATION 
Suppose we have a system of linear equations 
aati F 0ta +... F aina =O, ) 
aai Faz +... F an Xp =b,, | 


t 
Ani% F Gy 9% vee + annn =b, } 


(1) 


We transform this system as follows: transpose the constant terms 
to the left and divide the first equation by —a,,, the second 


314 Ch. 8. Solving Systems of Linear Equations 


by —dyp, etc. We then obtain a system that is ready for relaxation: 


—X H bike to HO, +0, = 0, 
bz% — Heat... +O, %, +e, =9, 


Sell os Heap tet Bastee Me da EEE wah Le (2) 
bai + Onea +... — Xx, +c, = 0 
where 
Gis, o, bi 
T Ee) and G= 
Let x = (x, xt, ..., x0) be the initial approximation to 


the solution of the system (2). Substituting these values into 
system (2), we get the residuals 


n 
(0) 
RP =e +S by, 


n 
w í p” ) 
RY =, Cyo—Xy” +2 by xP , 
jee 


(3) 


nwt 
(0) (0 y 
Rr =C,—Xn + b,x 


If we give an increment of 5x to one of the unknowns x, then 
the corresponding residual R® will be diminished by the quantity ôx% , 
and all the other residuals Rj? (iés4s) will be increased by the 
quantity 6,,6x{?. Thus, to make the next residual R vanish, it 
suffices to give the quantity xi” an increment of 


6x = RO 
and we have 
R®”=0 
5 


and 
RO = RO +b,,6x for is 


The method of relaxation [3], [4] in its simplest form consists 
in reducing the numerically largest residual to zero at each step 
by changing the value.of the appropriate component of the appro- 
ximation. The process is terminated when all the residuals of the 
last transformed system are equal to zero within the required accu- 
racy. We will not consider the question of the convergence of this 
process [4]. 


8.14 Method of relaxation 315 


` Example. Solve the following system by the method of relaxation [3]: 
10x,-—- 2x,— 2x,—6, 
—%*,+10x%,— 2x,=7, 
—*%,— *,+10x,=8 


(4) 


carrying the computations to two decimal places. 


Solution, We reduce the system (4) to a form convenient for rela- 
xation: 


—*x,+0.1x,+0.2x,+0.7=0, 


—x,4+-0.2x, +0.2x, +0.6=0, | 
—x, + 0.1x, +0.1x,+0.8=0 


Choosing the zero values 
xo he gw Z= x =0 
for the initial approximations of the roots, we get the respective 
residuals: 
Ro =0.60, RY =0.70, R® =0.80 
By the general theory, we assume 
ôx = 0.80 


whence we get the residuals 
RY = RY -0.2.0.8 = 0.60 +0.16 = 0.76, 
RP = RY+0.2-0.8=0.70+-0.16 = 0.86, 
RY = ROR —0 


Now we set 
6x = 0.86 


and so on. The results of the computations are given in Table 23. 
Summing all the increments 6 ({=1, 2, 3; k=0, 1, ...), we 
get the values of the roots: 
x, =0 +0.93 + 0.07 = 1.00, 
x, =0+0.864+0.13+0.01 = 1.00, 
x, =0+0.80+0.18-+ 0.02 = 1.00 
Check by substituting the values of the roots thus found into 


_ the original equations; in this case the system (4) has been solved 
© exactly, ; 


316 Ch. 8. Solving Systems of Linear Equations 


TABLE 23 


SOLUTION OF A LINEAR SYSTEM BY THE METHOD OF 
RELAXATION 





8.15 CORRECTING ELEMENTS OF AN APPROXIMATE 
INVERSE MATRIX 


Suppose we have a nonsingular matrix A and it is required to 
find the inverse A~?. Also suppose we have found an approximate 
value of the inverse matrix D, = A7}. It is then possible to improve 
the accuracy by using the method of successive approximations in 
a special form. For a preliminary measure of the error, we use 
the difference 

F,=E—AD, 

If F,=0, then plainly D,=A7?, and so if the moduli of the 
elements of matrix F, are small, then matrices A~! and D, are 
nearly equal. We will construct successive approximations by the 
formula 

Dy, = Dy_y+ Dai pd (k= l, 2, 3, . we) (1) 


8.415 Correcting an approximate inverse matrix 317 


the corresponding error being 
F,=E—AD, 


Let us estimate the rapidity of convergence of the successive 
approximations. We have 


F,=E—AD,=E—A(D,+D,F,)=E—AD(E+F,)= 
| =E—(E—F,) (E+F,)=E—(E—F =F 


Similarly 
| ee Beer ge 
and, generally, 
 Fz=F (k= 1, 2, 3, ...) (2) 
We will prove that if 
lFollsq<l (3) 


where ||,{| is some canonical norm of the matrix F, (Sec. 7.7), 
then the process of iteration (1) converges, that is, 


lim D,= Aq? 
k> o 
Indeed, from formula (2) we have 
Il Fell <I Fol?’ <q 
And so 
‘fim || F, || =0 
k> œ 
and, consequently, 
lin F,= lim (E — AD, =0 
k> œw k> o 
or i 
E—A lim D,+0 
k r+ © 
that is 
lim D,= ATIE = ATI 


k => œ 


Thus, the assertion is proved. 
In particular, using the m-norm (Sec. 7.7), we find that if the 
-elements of the matrix F,=[f;,] satisfy the inequality 


MES 


where n is the order of the matrix and O0<qg < 1, then the pro- 
cess of iteration (1) definitely converges. 


318 Ch. 8. Selving Systems of Linear Equations 


Assuming inequality (3) to be valid, we estimate the error 
Ry =|| A727 —Dyll<| AT E— AD =A Fall <A 
Since 

AD,=E—F, 
it follows that ; 
A-t=D,(E—F,) =D, (E+ Fo t+ Fit...) 
whence 
ASD MEI ++ e+ ---}=NDol{El+ 2h 


For the m-norm or the /-norm we have ||E]|=1, and so 


Any < eed 
Thus 
A — Dy <I (4) 
or 
| A Dy It <a gt (5) 


where the norm is to be understood in the sense of the m-norm 
or /-norm. From formula (4) it follows that the convergencé of the 
process (l) is very rapid for g<l. 
In practical situations, the process of improving the elements 
of the inverse matrix is terminated when the inequality 
| Dx —Dy_1 I] Re 
where e is the specified accuracy, is ensured. 


Example. Correct the elements of the approximate inverse matrix 
obtained in the example of Sec. 8.7. 


Solution. Using the Gaussian method, we obtain for the matrix 


18 —3.8 0.7 —3.7 
0.7 2.1 —2.6 —2.8 
7.3 8.1 1.7 —4.9 
1.9 —4.3 —4.9 —4.7 


the approximate inverse 


—O.21121 —0.46003 0.16284 0.26956 
—0.03533 0.16873 0.01573 —0.08920 

0.23030 0.04607 —0.00944 —0.19885 
—0.29316 —0.38837 —0.06128 0.18513 


A= 


D,= 


8.15 Correcting an approximate inverse matrix 319 


Here 
0.03 0.00 0.01 0.007 
_, 10.25 0.03 0.02 0.39 
AD,=E—10*-| 308 10.17 0.18 —009 
0.00 0.00 0.00 —0.48 
whence 


0.03 0.00 0.01 0.00 
0.25 0.03 0.02 0.39 
8.08 10.17 0.18 —0.09 
0.00 6.00 0.00 —0.48. 


To further improve ‘the elements of the matrix D, let us use 
the iteration process 


D, =D, +D,Fp F,=E—AD, (k=0, 1, 2, ...) 


F,= E—AD,:==107?- 


Since Ee 
q= || Fall = 1073- (0.03 + 10.17) = 1.02.1072: <1 
the iteration process converges rapidly. 
We have 
;—0.21121 —0.46003 0.16284 0.26956 
DF = —0.03533 0.16873 0.01573 —0.08920 7 
ited 0.23030 0.04607 —0.00944 —0.19885 
—0.29316 —0.38837 —0.06128 0.18513 
0.03 0.00 0.01 0.007 
0.25 0.03 0,02 0,39 
8.08 10.17 0.18 —0.09}— 
0.00 0.00 0.00 —0.48 
1.19 1.64 0.02 —0,32 
0.17 0.16 0.01 0.11 
"| —0.06 —0.09 0.00 0.11 
0.39 0.61 0.00 —0.24 


x 107?. 


whence 


—0,21121 —0.46003 0.16284 0.26956 
—0. 03533 0.16873 0.01573 —0.08920 
0.23030 0.04607 —0.00944 —0.19885 
—0.29316 —0.38837 —0.06128 0.18513 


D, = Da + DaFo = -+ 


320 Ch. 8. Solving Systems of Linear Equations _ 


1.19 1.64 0.02 —0.32 
0.17 0.16 0,01 0.11 
—0.06 —0.09 0.00 0.11 
0.39 0.61 0.00 —0.24 
—0.21002 —0.45839 0.16286 0.269247 
—0.03516 0.16889 0.01574 —0.08909 
0.23024 0.04598 —0.00944 —0.19874 
| —0.29277 —0.38776 —0.06128 0.18489 


We can take it that 


+1058. 


A-izD, 
since 
2—2 1 3 
: _—_ {l0 2—I 0 
ADE a 
l 0 0 1 
and 
m2 —2 1 3 
_, 10 2—i 0 
F,=E—AD,=10 | 4—5 
1 0 0 1) 
For the error we have the estimate, on the basis of formula (4), 
|) D 
At —D, l <P He, 
Since 
(| Dy |l: = 0.46003 + 0.16873 + 0.04607 + 0.38837 < 1.07 
and 


| Fy, |], = 1075-(24+2+44)=8-1075 
we finally get 


| A*—Dy l <p 8 10-8 < 910-8 

Note. Choosing the approximate inverse matrix can be done in 
a variety of ways. In particular, use is made of the method of 
matrix inversion given in Sec, 7.12. 

We conclude this chapter with the remark that many other me- 
thods for solving systems of linear algebraic equations have been 
elaborated (the method of Purcell, the escalator method [6], the 
method of Richardson [7] and others). 


8.15 Correcting an approximate inverse matrix 321 


REFERENCES FOR CHAPTER 8 


[1] 
[2] 
[3] 
[4] 


[5] 
[6] 
[7] 


V, N. Faddeyeva, Computational Methods of Linear Algebra, 1950, Chapter I] 
(in Russian). 

James B. Scarborough, Numerical Mathematical Analysis, 1955, Chap- 
ter XVIII, 

Mario G. Salvadori and M. L. Baron, Numerical Methods in Engineering, 
1952, Chapter I, Sec, 10. 

Edwin F. Beckenbach (editor), Modern Mathematics for the Engineer, 1956, 
P Series, Chapter 17, What Are Relaxation Methods? by George E. For- 
sythe. : 
Kh, L. PA Computational Mathematics (Lecture Notes), 1960 (in 
Russian), 

D. K. Faddeyev and V. N. Faddeyeva2, Computational Methods of Linear 
Algebra, 1960, Chapter II (in Russian). 

I, S. Berezin and N, P. Zhidkov, Computational Methods, 1959, Chapter V1 
(in Russian), 


2] 9616 


“Chapter 9 


THE CONVERGENCE OF ITERATION PROCESSES 
FOR SYSTEMS OF LINEAR EQUATIONS 


9.1 SUFFICIENT CONDITIONS FOR THE CONVERGENCE 
OF THE ITERATION PROCESS 


Supposé we have a reduced linear system: 


x=ax+ß (1) 
where 
B, 
a= [a], B= - 
Bn ; 
x 
are a given matrix and a given vector and n= * | is the desi- 
Xn 
red vector. 


Theorem. For the reduced linear system (1) the process of itera- 
tion converges to its unique solution if some canonical norm of the 
matrix a is less than unity; that is, a sufficient condition for con- 
vergence of the iteration process 


x) Bt ogy (k=1, 2, ...) 
(x arbitrary) is 
al< 1i (2) 
Proof, Starting with an arbitrary vector x, we construct a se- 
quence of approximations 
xe — p + axe 
r =p pax, 


whence i 
2M = (E 4a +a... + oak!) B+ ake (3) 


9.1 Sufficient conditions for the convergence 323 


Since for la||< 1 we have ||a*|| +0 as k— oo, it follows (see 
Sec. 7.10) that 


lim a* = 0 
k>o 
and 
lim (Eta+o?+t...+ta®)= Yoat=(E—a)- 
Raa k=0 
And so, passing to the limit in (3) as k — oo, we get 


x= lim x) = (E—a)71 (4) 
kaw 


This proves the convergence of the iterative process, Moreover, 
froin (4) we have 

(E—a)x=B 
or 

x=ax +B 


which means that the limiting vector x is a solution of system (1). 
Since the matrix E—qa of system (1) is nonsingular, the solution 
x is unique. 


Corollary 1. The iteration process for system (1) converges if 
(a) on = max & leyl <I 
or 
(0) [lel = max È lon] <1 
or 
L fk: n 
(c) lell = V 2 Dlan <I 


In particular, the iteration process definitely converges if the 
elements of matrix a satisfy the inequality 


1 
jal <z 


where n is the number of unknowns in system (1). 
Indeed, (a), (b} and (c) are the simplest canonical norms of the 
matrix a. 


Corollary 2. For the system 


uri (i= 1, 2, EIE n) (5) 


324 Ch. 9. Iterat. Processes for Systems of Linear Eqs. 


the process of iteration converges if the following inequalities hold: 


(a’) lal > SF lar] (i=1, 2, ..., n) 
or 
(b’) layl > È layl (=l, 2, Ber ab n) 


where the prime on the summation symbol means that the values 
i=j are dropped in the summation; that is, convergence. occurs if 
the moduli of the diagonal elements of the matrix A= [a;;] of 
system (1) either, for each row, exceed the sum of the moduli of 
the nondiagonal elements of that row or, for each column, exceed 
the sum of the moduli of the nondiagonal elements of that column. 

Indeed, given inequality (a’), the respective inequality (a) of 
Corollary 1 will clearly hold. 

To prove the second assertion, in (5) put 


n= (i=1, 2, e.p n) 
where z; are the new unknowns. We then get the system 


T ajy i 
a t (i=1, 2, ..., n) (5°) 

j=1 
for which the iteration process either converges or diverges simul- 
taneously with the process of iteration of the original system (5). 
Reducing (5’) to the special form (1) in the ordinary way, and 
utilizing Condition (b) of Corollary 1, we get a sufficient condition 
for the convergence of the process of iteration of the system (5): 








Set iad eat) 
tn W 
or 
layl > È layl GS 2, ..., n) 


9.2 AN ESTIMATE OF THE ERROR OF APPROXIMATIONS 
IN THE ITERATION PROCESS 


Let x") and x (k= 1) be two successive approximations of 
the solution of the linear system x == ax +p. For $1, we have 


|l KEP) — g IES || wD) x) |+ | xt — g B41) -+ 
+... t || ete) — a a Ij (1) 


9.2 Estimating error of appreximaf. iteration process 325 


Since 
KIMID gin + p 


and 
xm = gxi) +B 


it follows that 
KiMtD __ ln) — on (im) — y-n) 


and, hence, 
[e—a |< a g — xP |< 
< [ho |E] tto "|. for nt Rol 
Therefore, from formula (1) we get 
rto gH |] <j elt — lja fe eeP— et 


F l 
pila pai ete — g |] <et || 44D — x |] 


Passing to the limit in the last inequality as p— oo, we finally 
obtain 





[e= 2 e l (2) 
for k >l, or 
jr—s << lee] 
If 
lel<+ 


then the preceding formula becomes 
| x—x) Ii < Il x) glk) il 


Thus, in this case, if, 
il xP) (AOD {| < € 
then we also have 
Jr <e 
In the general case, if in the process of computations we find 
that 
[agin e 


where g=|[a|| <1, then 
Jart se 
and thus 
Jea; P] Ke Pe Oe ee 


326 Ch, 9. Iferat, Processes for Systems of Linear Eqs. 


It is of course assumed here that the successive approximations 
a (j=0, 1, , k) are computed exactly, which is to say that 
rounding errors are completely absent. 

Utilizing the above obtained estimates for the norm of the di- 
fference of two successive approximations, we get, from formula (2), 


k 
|e—2 <i le? 2 





in particular, if we choose 


eo =f 
then 
x) =aB+ B 
and 
[xO — 2 = jelis ilee] 
Hence 
a ljk +1 
xs” j< TET IB < 28) 


Example. Show that the process of iteration converges for the Íol- 
lowing system: 
10x, — x, + 2x,— 3x,=0, 
%,+10x,— x,+ 2x,=5, 
2x,+ 3x,+20x%,— x,=—10, 
Ox,+ 2x, + x, +20%,=15 
How many iterations have to be carried out to find the roots of 
system (3) to within 1074 
Solution, Reducing (3) to the special form, we get 
x= 0.1 x,-—0.2 x,+0.3x,, 
x,=—0.1 x, 40.1 x,—0.2x, + 0.5, 
x,=—0.1 x,—0.15x,+0.05x,—0.5, (3’) 
X,=—0.15x,—0.1 x,—0.05x, + 0.75 


(3) 


Then the matrix of the system is 


0 0.1 —0.2 03 
„|01 0 Ok 02 
~1—0.1 —015 0 0.05 


—0.15 — 0.1 —0.05 0 


Using, say, the norm |la|j,, we get 
iell, = max (0.35, 0.35, 0.35, 0.55) =.0.55 < 1 
Hence the process of iteration of the systèm (3’) converges. 


9.3 First suffic. condit. for converg. of Seidel process 327 


For the initial approximation of the root + we take 
0 


0.5 
a= Ph a0 


0.75 
whence 
Bil, =0+0.5 +0.5+0.75 = 1.75 


Let k be the number.of iterations required to achieve the spe- 
cified accuracy. Using formula (2°), we have 


jepej c LB 0.55£+1.1.75 < 10-4 


i—fel; 0.45 
From this, 
0.55*+1 zá We » 1074 
and 
(k+ 1) logio 0.55 < log, 45 — logo 175— 4 
or 


—(k+1)-0.25964 < 1.65321 — 2.24304 — 4 = —4.58983 


Consequent ly 


4.58983 _ 
k+l > gga © 17.7 


and 
k > 16.7 
We can take R=17. 
It should be pointed out that the theoretical estimate of the 


number of iterations necessary to'‘ensure the specified accuracy 
turns out to be excessively high. 


9.3 FIRST SUFFICIENT CONDITION FOR CONVERGENCE 
OF THE SEIDEL PROCESS 
Theorem. Jf for a linear system 
xc=anr+8 (1) 
the condition 
Welln <I (2) 


where 


n 
Nella = max È layl 


328 Ch. 9. Iterat. Processes for Systems of Linear Eqs. 


is fulfilled, then the Seidel process converges for (1) to its unique 
solution for any choice of the initial vector x, 


Proof. Let x = {x, ..., x} be the kth approximation in the 
Seidel process. We then have 
xP = 5 ayx P 2 OL; per DEB, (3) 
im 
= a S E 


If Condition (2) is fulfilled, the ora (1) admits the unique solu- 
tion x={x,,...,%,}, which may be found, say, by the method 
of iteration. We have 


ite 


4 


ae tB; (i= 1,2...) (4) 
n we get 


I 


Subtracting (3) from 
x, — xh) = b Oh, (x; —xP)4 So (x; —xt- v) 


Ta li 


whence 


i~l n 
jaa S D legla D leully—xP PL) 
@=1,2,...,n) 
According to the meaning of the accepted norm, 
|e— a = max |x — i 
and so i 
|x; —- 4 |< x — el 
(j=1, 2,..., n). Hence, from inequality (5) we derive 


lamb] S piler |, +9; )e—2* (6) 
where 


in] n 
Pi= pa [æl and qi= p> [æ] 
Let s=s(k) be the value of the index i for which 
[xs— 29 | = max [xr |= [la — x ly 


Assuming i=s in inequality (6), we get 


x2 lg < Pelle —2 In + lle — 2" I 
or 


| xe) lr < 





9.3 First suffic. condit, for converg, of Seidel process 329 


whence 
(e—a |a < Hli rE] (7) 
where : 





p= max pt (8) 
We will show that 


HS [1% |e <1 
Indeed, since 


n 
atu & layl sliall <1 
it follows that 





qi Sll lla — P: 
and hence 
i ary < {| oe < lj a lanei a llin = | a Ila 
Therefore 


p=]elae< 1 
From inequality (7) it follows that 
EE 


and consequently 
lim x =x 


k> w 


This completes the proof that the Seidel process converges to the 
required solution. 


Note. Since for the method of iteration we have 
e—a S e lalea || 
and for the Seidel method we obtain 
| ~¥—x sS pl] || 
where p <|/a][,,, it follows that under the conditions of the theo- 
rem the convergence of the Seidel process is in general somewhat 
better than the convergence of the process of simple iteration. 
From formula (8) it follows that in this case when we use the 
Seidel method, it is convenient to arrange the system (1) so that 


the first equation of the system has the smallest sum of the moduli 
of the coefficients: 


a= 2 [æl 


330 Ch. 9. iterat. Processes for Systems of Linear Eqs. 


9.4 ESTIMATING THE ERROR OF APPROXIMATIONS 
IN THE SEIDEL PROCESS BY THE m-NORM 


Let « and x**)) be two successive iterations in the Seidel 
process. Applying to these iterations the transformations: utilized 
in the proof of the theorem of Sec. 9.3, we get an inequality 
similar to the inequality (7) of Sec. 9.3: 


| TD _ glk) lle < p |! x a lend) Jla 
From this . 


|E gh p LIKET Et a 
pater ttr a Ho HD gp a 
<p? |e — P|] we | KO gtd I 
bp lee — x8 LS Fy lle? — DIL 


As p— oo we get 
lim xt = x 


pro 
and hence 
lsr” <A — 2” 
where 


In particular, we derive 
è k 
[e—a a < Elea 


from the inequality obtained, that is, 


|x,— P] 


pee max | x4? — x | (i=1, 2 n) 
l—p i j fa , SEE aa aa 


9.5 SECOND SUFFICIENT CONDITION FOR CONVERGENCE 
OF THE SEIDEL PROCESS 


Theorem. Jf for a linear system 
a= (1) 


the condition 
: I| l; <1 


9.5 Second suffic. condit. for converg. of Seidei process 331 


where 
lle (= max PALZ 


is fulfilled, then the Seidel process converges to a unique solution of 
system (1) for any choice of the initial vector. 


Proof. Suppose 


(k) 65] tk) 5 k-11) (i=1, 2, a n 
x} = D aysi + Bays +B; k=l; 2, ...) 2 


For the exact solution *={%x,, X2 ...,%,}, which exists and is 
unique, we have 


i—1 
m= 2 Qij x+ E x, +B; (3) 
Subtracting from (3) the corresponding equations (2), we get 


int re 
x;— xP = 2am ( Xx; (x; — xP) + È ai (z; — x") 


whence 
-t 


ERES Seeed P+ È Jalla; —x}- o] 
=1 
(i=1, 2, ..., D 


ii 


Summing these inequalities, we get 
n n i n n 
2 [x;— xP << 2 Z lalla. i+ X È la x; — xt] 


or, changiùg the order of the summation, we obtain 


Blears Baap S lalt Biya | X layl 0 


i=j+1 
Set f 
n i 
FEN > lazh t= D lezl (j=1, 2, eee, n— l1) 
i=j+1 1=1 
and 
S„=0, t= Dae 
Obviously 


+= $lel < æl <1 


332 Ch. 9, literat. Processes for Systems of Linear Eqs. 
whence 

s <i 
Inequality (4} takes the form 


n 


x n 
A a <2 si| x +È tlet] 
or 
n n 
p> (1—s;) | x;— x [= p tlx xt | 
Since 


S lo |;—s;< Ha l — s; al =le 0 —s;) (5) 
we then have i 


<|laltt $i (L—sj)|4;—x) | (6) 


Whence, passing to the limit as k— oo and noting that æli <1, 
we get 


n 


lim >) (l—s,)|x,—x |=0 


k=» œ j=l 
Hence 
lim xP =x; G= l 2n) 
k> o 3 


and the proof is complete. 


9.6 ESTIMATING THE ERROR OF APPROXIMATIONS 
iN THE SEIDEL PROCESS BY THE /-RORM 


Suppose 
Chay = È (1—s,) | af? — xP | (k=0, 1, 2, ...) 
j=l 


Utilizing transformations similar to those used in the proof of 
the theorem of the ea section, we get the following inequa- 
lity [inequality (6) of Sec. 9.5} for two successive iterations x? 
and x +1). rs 
Ops < PO, (1) 


where, by virtue of inequality 5 of Sec. 9.5, 


tj 
p= max <a 


9.7 Third suffic. condit. for converg. of Seidel process 333 


whence 
Opto < pPO, (p= l, 2, .. 3 
We then have 


= (1s; ) | attr — P| < Ogept Optp-rt- + + Op S 


<P OPA Opt. +S EE 


From this, we get, as p— œ, 
n 


Lispi rp] p 


j=1 7P 


or 
n 


n 
p — 
2 lsP 1S TEE Jx — xt- | 
1=1 j=1 


where 
n 


s=maxs;=max $ |a,,| 
Í i i=j+1 


Since it follows from formula (1) that 
O,<p* to, 
the estimate 


n 
= př 
4-4 P l= La yP Sar S 


pe : 
<q=p tay PP 


is also valid. 


9.7 THIRD SUFFICIENT CONDITION FOR CONVERGENCE 
OF THE SEIDEL PROCESS 


Theorem. /f for a linear system . 


x=ax+ß (1) 
loll, <1 
læ sl= lk P 


is fulfilled, then for system (1) the Seidel process converges to its 
unique solution for any choice of the initial vector. 


the condition 


where 


334 Ch. 9. iterat. Processes for Systems of Linear Eqs. 


Proof. Suppose 


EA x 
. af y 
x=| >: |- and xi = 
Xn xP 


are, respectively, the exact solution of system (1) and the ptt 
` approximation (p=0, 1, 2, ...) of the Seidel process for this sys- 
tem. We have 


my -5 QX; + È aytb: 

and 
ra Bapt Sopot 
(=1, 2, oony nh whence 
P -$ a xP) $ a; (X; —x?-D) 

and, thus, 

fiz $ $ 

[eP PS E Leela Daylle i 


Applying the Cauchy inequality (Sec. 7.7) to the sum of all terms 
in the braces, we obtain 


f-1 at f 
| x,— x? h<s{ D laa + Birma j - (2) 
where 
Te ae eee 
i a 


Summing the inequalities (2) from 1 to n with respect to i, 
we get 


is =P P< > F, sley P+ 2 Sor apep 
=1 j2 
Changing the summation index in the id member and the order 
of summation in the right member of this inequality, we obtain 
n n=1 n n i 
ley eR P< È lepp D s+ Dle vp Ds, (8) 
i=1 j=1 i=j+1 j=1 is1 


9.7 Third suffic. condit. fer converg. of Seidel process 335 


Suppose f 
j : 
S= > Si, =D $3 (=l, 2, eos n—I) 


and 


We evidently have 
S +T;= Bss BRioyP=lole<1 G=1, % a) (A 


Using these notations, we can represent inequality (3) as 


n 


A n 
p3 |x — xP j < p> S; | xj — XP +È Tampo p 
or 


< n 
2 (1—S,) | Xj— xP P< = T; xj—xipoy j 


On. the basis of formula (4), we obtain 
T= ||o |k—-S;< |] @ [kl] @ ES = la lk S) 


Therefore 
(PS) lap P< IRAI Pa (1—S S) | x;— xP -D |? (5) 


From inequality (5), for p> 1, we successively derive 
D (1—S)) | %— PS lr 2 ( (1—S,) | 4; P 
" ft 
Since jall,<1, we then get 


lim DS) lx a0 
pre j= 
and, consequently, taking into account that O<S,<1(j=1,2,...,1) 
we obtain 
lim xP =x, Gat 2h) 
p> w 


which is what we set-out to prove. 

Note. The error of iterations x” (p=1, 2, ...) is estimated in 
the same way as in Sec. 9.6. 
REFERENCES FOR CHAPTER 9 


[j V. N. Faddeyeva, Computationai Methods of l Linear Algebra, 1950, Chap- 
ter i, Secs. 17 and 19 (in Russian). 


~- Chapter 16 


ESSENTIALS OF THE THEORY 
OF LINEAR VECTOR SPACES 


10.1 THE CONCEPT OF A LINEAR VECTOR SPACE 


Definition. An ordered n-tuple of numbers x =(x,, Xa ..., %,), 
which, generally speaking, are complex, is called a point or a vec- 
tor of n-dimensional space, while the Te Kis Xyp aa are 
termed the .coordinates of the vector x [1], [2], [3]. The folipwing 
are vectors. 

(1) The free vectors in a plane or in three-dimensional space 
are two-dimensional or three-dimensional vectors, respectively, in 
the meaning of the definition given above. 

(2) Any solution of any system of linear equations in a unknowns 
will be an n-dimensional vector. 

(3) Ii we have an n by m matrix (n rows and m columns), the 
rows are m-dimensional vectors, and the columns are n-dimensional 
vectors. 


Two vectors ¥=(x,, Xa ---, Xay and y=(y,, Ya ..., Yn) are 
considered equal if and only if their coordinates standing in the 
same positions coincide, that is, if x; ov ei for i=1, 2, ..., A. 

We denote the vector (0, 0, ..., 0) by 0 and call it the zero 
vector, 

The sum of the vectors Bae hy iy SV SU Ui ee 


is the vector 


e+ Y= (4,44, Xa F Yo Lk aoe Xn + Yn) 
whose coordinates are the sums of the respective coordinates cf 
the vectors being added. Addition of vectors obeys the commuta- 
tive and associative laws: 
(1) F+Y=V+X 
(2) (*¥+Y)+2=%+(y+2) 
The difference between two vectors x and y. is defined in simi- 


lar fashion. A vector — x which satisfies the condition (—x)++«=0 
is called the negative of the vector x. It can easily be shown that 


*—y=x+(—y) 


10.2 Linear dependence of vectors 337 


The product of a vector X= (xi, x,,..-,%*,) by a scalar k is the 
vector 
kx =(RX,, RX, ..., RX) 


From this definition follow the properties of the product of a 
vector by a scalar: 


(1) k(x by)—he thy, 
(2) (RAN x =ke 4Ix, 
(3) k (lx) = (Rl) x 
(4) Ox =0, 

(5) lx= x, 

(6) (—i)x=— x 


where k and / are arbitrary scalars and x and y are vectors. 
For vectors x and y it is natural to define the linear combination 


ax + By, 


where œ, B are scalars, as a vector with coordinates ax,+By,(i=1, 
Dyrs AN: 

Any ARA of n-dimensional vectors which is closed under 
the operations of addition of vectors and the multiplication of a 
vector by a scalar is called a linear vector space. As an example, 
the set of all n-dimensional vectors forms an n-dimensional vector 
space E,- 


10.2 THE LINEAR DEPENDENCE OF VECTORS 


Definition 1. The vectors x®, x, ..., x'™ of a space E, are 
termed linearly dependent if there exist scalars Clg stay Cg not 
all zero, such that 

cP peH se ee ea 0 (1) 


Assuming c, 0, from equation (1) we have 


HM ye) oy gle) Se + Vinwt ian 
where 
C; l 
y=- (j=l, 2, ..., m—1) 

Thus, Zhe given vectors are linearly dependent if and only if one 
of them is a linear combination of the others. 

But if (1) is possible in the unique case where c,=c,=...=c, =O, 
then the vectors x, x, ...,# are called linearly independent; 
that is, the vectors are linearly independent if and only if no li- 
near combination of them (not all coefficients of which are zero) is 





2? 9616 


- 338 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


a zero vector. There must clearly not be any zero vector among 
linearly independent vectors. 


Example 1. For the case of a> three-dimensional vector space, E,, 
the linear dependence of two vectors x and y means that they are 
parallel to some straight line, and the linear dependence of three 
vectors x, y and z, that they are parallel to some plane. 

Note that if some of the vectors are linearly dependent then 
the whole set of vectors is also linearly dependent. 

Suppose we have a collection of vectors 


KY = (xP, xP, ..., oP) (f=, 2, ..., m) 


Then, in order to determine the constants c,(k==1, 2, ..., m) 
we get the following system by virtue of equation (1): 


CX]? + CX +... H Ca = 0, 
CAD ORY bb gel” = 0, 


e 8 8 8 ee @ eh eh ee ele 


(2) 


If this system has némtrivial solutions, then the given vectors are — 
linearly dependent, otherwise they are linearly independent. 
Consider the matrix of the coordinates 


C1) (2) (my 
To ee xd 
xP x, xm 


A= 


Ae ee” ead See 

Let r be the rank of the matrix. Proof is given in algebra [2] that 
the system (2) has nontrivial solutions if and only if r < m. Hence, 
the vectors «™, #@,-..., x'™ are linearly dependent if r< m- 
and linearly independent if r=m (the rank r cannot, obviously, 
be greater than zm). 

From this it follows that the rank of matrix X gives us the 
maximum number of linearly independent vectors contained in the 
given set of vectors. 

Thus, if the rank of the matrix X is equal to r, then among 
the column vectors x) (j=1, 2, ..., m): (1) there will be r li- 
nearly independent vectors, and (2) every set r+1 of vectors 
(r+1</m) of this collection are linearly dependent. The same is 
true of the row vectors (x, ..., a/?)(i=1, 2, ..., n) of ma- 
trix X. 


10.2 Linear dependence of vectors 339 


Exampie 2. Test for linear dependence the system of vectors 
£9 =(1, —l, 1,1, l, 
x=(1, 0, 2 0, D), 
#®=(1,—5, —l, 2, —1), 
x®=(3, —6, 2, I, D. 


Solution. Form the matrix of the coordinates 
C 1 1 1 37 


—i 0 —5 —6 
X= 1 2—1 2 
—f 0 2 1 


© l 1—l 1l 


To determine the rank r of X perform some elementary transfor- 
mations: namely, subtract the sum of the first three columns from 
the fourth column to get 


Ft i 1 oF 


—1 0—5 0 
Xa I 2—1 0 
—1 0—2 0 


C 1 1-1 0) 


From this we conclude that all determinants of fourth order 
of the matrix X are zero. It is clear that there are third-order 
minors of X different from zero. Hence, r= 3, and since the rank 
of the matrix is less than the number of vectors, the vectors x, 
x, x x™® are linearly dependent. This is evident in the given 
case since 

KHL 24 yi) __ yd) 9 


Theorem 4. The maximum number of linearly independent vectors 
of an n-dimensional space E, is exactly equal to the dimensionality 
of that space. 


Proof. First of all, the space Æ, has a system of n linearly in- 
dependent vectors. Such, for example, is the set of n unit vectors: 


e,=(1, 0, 0, ..., 0), 
22510; 1, 0, ..., O), 


e.. e o e >s s > 


D The symbol co is used to indicate “similar matrices”. See Sec, 10.13. 


340 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


Thus, if 
C;0, +0,0, +... +6,€, = (Cy, Cy +++, C,) = O° 


then, obviously, ¢,=c,=...=c,=0. 

We will show that if the number of vectors #™, æ, 1.1, xm 
is greater than n (m >n), then they must definitely be linearly 
dependent. Indeed, the matrix of the coordinates of these vectors 
has the dimensions nxm and, consequently, its rank r < min 
(n, m)=n<m, whence it follows that these vectors are linearly 
dependent. 


Definition 2, Any set of n linearly independent vectors of an 
n-dimensional space is termed a basis of that space. 


Theorem 2. Every vector of an n-dimensional space E, can be re- 
presented uniquely in the form of a linear combination of the vectors 
of a basis. 


Proof, Let xE E, and &, &, ..., &, be a basisof E,. By The- 
orem 1, the vectors x, 2, &, ..., &, are linearly dependent, that 
is 

CX + CGE +¢,€,+...+¢,8,—0 (3) 


where a certain coefficient c, #0 oe <n). 
In (3) the coefficient c, 0, since otherwise we would have 


C8, +¢,8,+...+¢,8, —0 
where c; +0 (i> 1), which contradicts the linear independence of 


the vectors &,, &, --., £, Thus, we can solve (3) for x: 
w= 58, + 6.8,+...+5,8, (4) 
where i 
_ a —_& =— fn 
-a ES i eee B= Co 


Thus, any vector x of the space E, is a linear combination of 
the basis vectors.. The expansion (4) is unique. Indeed, if there is 
another expansion 


x= bie, HEH o H Ehen (4°) 
different i the first, then, subtracting (4’) from (4), we get 
= (&,—E,)&+(,—&)e.+... +, En) En (5) 


where at least one of the coefficients §,—&; £0. Equation (5) is 
impossible because the vectors of a basis are linearly independent. 
Hence, there is only one expansion of the form (4). 


Geometric illustration, For the case of a three-dimensional space, 
formula (4) is equivalent to a decomposition of the vector æ along. 


10.2 Linear dependence of vectors 344 
the directions of three given vectors s, g, and g, in the standard 
position (Fig. 49). 


Definition 3. lf s, £, ..., £, form a basis of n-dimensional 
space and ; 


“= e,+6,e,+...+6,¢, 


Fig. 49 





then the scalars &,, &,, he E, are called the coordinates of the 
vector x in the given basis s, ©, ..., €,. Note that the coordi- 
nates of the vector 


R= (iy Lop ey %,) 
are its coordinates in the basis of unit vectors 
e,=(6,;, ôs» wees 5,/) (j=1, 2, e.s n) 


where 6,; is the Kronecker delta. We thus have the basic expan- 
sion : 
X = X18, F Xa +... F Anen (6) 


The basis of unit vectors e;(f=1, re n) will be called the” 
inttial basis of the space. 


Definition 4. A ‘set E, of vectors of an n-dimensional space E, 
is called a linear subspace of E, if the following conditions are 
met: 

(1) from «EE, and y €E, follows x+y €E, 

(2) from x€ E, follows ax € E, where @ is any scalar. In par- 
ticular, 0€ E,. 

Consequently, E, may also be regarded as a vector space. The 
maximum number & of -linearly independent vectors in E, is called 
the dimensionality of the subspace. 

From Theorem 1 it follows that k<cn. Thus, the space E, can 
have as subspaces: E, of one dimension, E, of two dimensions, 
and so forth up to E, of n dimensions (the space itself). The 
zero vector 0 may be regarded as a space of zero dimensions. 


342 Ch, 10. Essentials of the Theory of Linear Vector Spaces 


Example 3. In ordinary three-dimensional space, E,, the subspace 
E, of one dimension is a straight line; the subspace E, of two 
dimensions is a plane (Fig. 50). 





Fig. 50 


Theorem 3. /f Zi, Z,, ..-, 2 are vectors of E,, then the total 
set of vectors 


X= 0,2, 74,2, +...1, 2, (7) 


where a; (j=1, 2, , k) are arbitrary scalars, is a subspace of 
Es and if the vectors Z, Za +++) Zk (k<n) are linearly indepen- 
dent, then the dimensionality of this subspace is equal to k. 

Conversely, any subspace Ep of space E, coincides with the set of 
all linear combinations of linearly independent. vectors Zi, Z,, ..., Zk 
of that subspace (basis vectors). ' 


Froof. The validity of the first assertion of the theorem can be 
verified directly. 

Let us prove the second assertion. Let x€ E, and sacs x 
is not a linear combination of the basis vectors 2,, 2,, ..., 
Then, obviously, the vectors x,, 2, 2,, , 2, are linearly ee: 
pendent and, thus, space E, hask-+-1 linearly independent vectors. 
But this cannot be, since, “by hypothesis, the maximum number 
of linearly independent vectors of E, is k. 

Hence, for some choice of the scalars a,, a,, ..., a we have 


X= ZH Zt o +042 
which is what we set out to prove. 


Corollary. The collection of vectors æ defined by formula (7) 
is the smallest linear space containing the vectors 2,, Za ..., 2 
(it is called the space generated by, or spanned by, the vectors 
Zi, Zar very Spe 


10.3 Scalar product of vectors 343 


10.3 THE SCALAR PRODUCT OF VECTORS . 
Suppose we have the vectors 
x= (4, Xos very Xn) and Y= (Yis Yor «+++ Yn)” 


in an n-dimensional space £,. We assume the coordinates of the 
vectors to be complex numbers: 


x= E+], y= ay tin; 
where ¿= —I, j=l, 2,..., 7. 
We introduce the conjugate quantities 
x= ibn y= nyin 
Then we obviously have 
xxi =l t 


By a scalar product of two vectors we mean a number (scalar) 


(x, Y= Dey; (1) 
A scalar product has the following properties. 
1. The property of positive definiteness. The scalar product of 
a vector into itself is a nonnegative scalar equal to zero if and 
only if the vector is zero. Indeed, from formula (1) we have 


n n 
. (x, x)= 2= leh So 


Clearly, (0, 0)=0. Conversely, if (x, ¥)=0, then x,=0 (j=1, 
2, ..., n) and, hence, *=0. 

2. Hermitian symmetry. If two factors are interchanged, the 
scalar product is replaced by its conjugate. True enough, using 
the theorems on the conjugate quantity of a sum and of a pro- 
duct,’ we have 


(9, x)= Bux Qiu) = (es) = (x, yy 


Hence 
(y, x)= (*, y) (2) 


D Here, we take advantage of the following theorems: 

(a) the conjugate of a sum is equa] to the sum of the conjugate quantities 
of the terms; 

(b) the conjugate of a product is equal to the product of the conjugate 
quantities of the factors. 


344 Ch, 10. Essentials of the Theory of Linear Vector Spaces 


3. A scalar factor in the first position can be taken outside the 
sign of the scalar product: 

(ax, y)=a(x, Y) (3) 

The proof of this property follows directly from formula (1). 

Corollary. The scalar factor occupying the second position may 
be taken outside the sign of the scalar product and replaced by 
its conjugate. We have 

(x, ay)= (ay, x} = [a (y, ¥)]*=a*(y, x} =a" (x, y) 
and so 
(x, ay) =a" (x, y) 

4, Distributivity. If the first or second vector is the sum of two 
vectors, then the scalar product of this vector is the sum of the 
corresponding scalar products of the summands of the vector. 
Suppose 

$ KH=KDL xD 
where (0) xP, ..., 4) (k=1, 2). 

Proceeding from the definition of a sum of vectors, we get, by 

formula (1), 


n n n 
(x94 4, y)= Dep + xP) yi = D Pys 2 xy = 
J= . I I= 


= (x, y) +(x, y) 
(xP 4, yyH(x™, y) t (ae, y) (4) 


That is, 


Furthermore, 
(x, yt y®) = (y+ yy”, x= (vy, x+ (vy, x= 
l =(x, yY) +(x, y”) (5) 


Formulas (4) and (5) may readily be extended to any finite 
number of vectors, namely: 


m Ob m t 
(San, Zs”)= 2 > (x, y™) 
j=l k= j=1 k=l 


Besides the n-dimensional complex space that we introduced, it 
is useful to consider an n-dimensional real space consisting of a 
set of vectors with real coordinates. 

In an n-dimensional real space, a scalar product is equal to 
the sum of the products of the appropriate coordinates of the 
vectors 


(x, y= is (1’) 


10.4 Orthogonal systems of vectors 345 


The properties of a scalar product given above are then formu- 
lated as: 

(1) (x, x)=20 and if (x, x)=0, then x=0, 

(2) (x, Y)=(¥, x), 

(3) (ax, = (x, ay)=a(x, y) (a real), 

(4) (x+y, 2) =(%, 2)+(y, 2), 

(x, Y+2z2)=(*, Y) +(x, 2). 

Using the scalar product, we can define the basic metric con- 
cepts in an n-dimensional space: length of a vector and the angle 
between two vectors. 

l. Length of a vector. The length of a vector in an n-dimen- 
sional space is the nonnegative scalar 


[¥|= +V (x, x) 


This definition is clearly in agreement with the notion of the 
length of a vector in three-dimensional space. 

2. The angle between two vectors. The angle g between two 
vectors x and y is that angle (between 0° and 180°) for which 


(x, ¥) 
lx iy] 

For vectors in three-dimensional space, this definition is in 
agreement with the ordinary expression of the angle between two 
vectors in terms of the scalar product. We can prove that the 
following inequality holds true [1]: 


I(x, ysi] 


For this reason, the angle between two vectors in real space is 
real. 





cos ọ = 


10.4 ORTHOGONAL SYSTEMS OF VECTORS 


` Definition 1. Two vectors x and y in E, are called orthogonal 
if their scalar product is equal to zero: 


(x, y)=90 (1) 
Ii the vectors are nonzero, then orthogonality signifies that the 
angle between them is equal to >: The zero vector is plainly 


orthogonal to any vector in the space. 

Thus, orthogonality is a generalized property of perpendicula- 
rity. i 

Definition 2. A set of vectors x, x®, ..., «> is termed 
orthogonal if any two vectors of the set are orthogonal to each 


346 Ch. 10. Essentials of the Theory of Linear Vector Spaces” 


other, that is, . 
(x, x®)=0 for jÆk 


It will be noted that if the vector x is orthogonal to the 
vectors x, ..., #™, then this vector is orthogonal also to any 
linear combination of these vectors; in other words, the vector 
x is orthogonal to the space spanned by the vectors x‘, 
x, Indeed, if 


(xe, x) =0 for k=2, ...,m 


then we have 
m m 
A 
(=, Èc x = paz: (4, i?) =0 


where c,, ..., €, are arbitrary constants. 


Theorem. Nonzero pairwise orthogonal vectors x, x, ..., x™ 
are linearly independent, 


Proof, Let 
GY eD... +0, 0 = 0 (2) 
Form the scalar product of both members of (2) by x; we get 
ci (4, xD) +. oF (r, x?) + ee -eh (x9, By oe 0 
or, since 
(x®, x) 40 and (xY, x) =0 for j-41, then c[=0 and ¢,~0. 
In exactly the same way we prove that c,=0, ..., c,=0. 


Hence, the vectors xt, x®, ..., «' are linearly independent. 


Coroljary. An orthogonal tein in n-dimensional space, E,, 
has at most n vectors. 


Definition 3. A basis £g, ©, ..., &, of E, is termed orthogonal 
if the basis vectors are pairwise orthogonal; that is, 


(E; £,=0 ior jk (j, k=1, 2, ..., n) 
Ii, moreover, the vectors © (j,=1, 2, ..., n) are unit vectors, 


then the orthogonal ` basis is called a normalized orthogonal basis 
(an orthonormal basis, ior short). We then have 


(&;, E) = E 


where 5,, is the Kronecker delta. 


10.4 Orthogonal systems of vectors 347 


It is easy to see that the simplest orthonormal basis of. E, 
space is the system of unit vectors 


eé,=(1, 0, 0, ..., 0), 
e,=(0, 1, 0, ..., 0), 


ee 6 è © > o è è s 


that constitute the initial basis. ; 

An orthogonal basis &,, &, ..., &, can always be normalized 
by dividing each of the vectors g; by its length. The resultant 
vectors i 

Su ee (j=1, 2 n) 
1 Vle; ej) ’ p eee 








form an orthonormal basis. 
Let us express the coordinates of the vector x in the orthonor- 
mal basis £, &, ..., &- If 


x= he, +8: + +--+ EnEn (3) 
then, multiplying (3) scalarly on the right by s, we get 
b=(#,8)  (Q=1, 2, ..., n) l (4) 


By analogy with vector algebra, we can say that the coordina- 
tes of a vector in an orthonormal basis are equal to the projections 
of the vector on the corresponding vectors of the basis. 

Squaring (3), we get 


(x, x) = (È te >> been) = 


=È De k (8), &,) = p» oe) = 2 Bi (5) 


Thus, the square of the length of a vector is equal to the sum of 
the squares of the moduli of its projections by the basis orthonormal 
vectors (an analogue of the Pythagorean theorem). In particular, if 
the space E, is real, then formula (5) may be written without 
the modulus sign: 


(x, x) 2 É? (5) 


= 


348 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


10.5 TRANSFORMATIONS OF THE COORDINATES 
OF A VECTOR UNDER CHANGES IN THE BASIS 


Suppose @;, @,, ..., @, and €, &, ..., &, are two bases of one 
and the same linear space £,. Each vector of the new (second) 
- basis £; yess in the old (first) basis e,, certain coordinates s,,, 
Saj eee Spyps ™) that is’ 


Ej = S18, F Sojla +... F Sn jen (j=1, 2, ..., h) (1) 


The nonsingular matrix S=[s,,] is called the change-of-basis 
matrix from the old basis to the new basis. (The determinant 
det S0, otherwise the vectors £, &, ..., &, would be linearly 
dependent.) This matrix is the transpose of the matrix that speci- 
‘fies the transformation of the basis. Let x be a given vector. 
Denote by x; the coordinates of this vector in the old basis and 
by & its coordinates in the new basis. Obviously, 


whence, substituting into the second sum the expression (1) for e,, 
we get 
x=, 


= Db Sse; = > e; > SiS; 


Thus, by virtue of the linear independence of the vectors e,, 
es, >., @,, we find 


M> 


n 


x;= pa 85/8; (i= l, 2, tg n) (2) 
If we denote 
Xi $ 
Ky a 


(in other words, we consider the vector x in the new coordinates 
as a transformed vector referred to the old basis), then relation (2) 
may be rewritten in the following matrix notation: 


x= SE (3) 
which is to say, the vector in the old coordinates (basis) is equal 
to the change-of-basis matrix S (or the transpose of the matrix 


1) In designating the coordinates, we put in the first position the number 
of the old basis vector, and in the second, that of the new basis vector. 


10.5 Transformations of the coordinates of a vector 349 


specifying the new basis) multiplied by the vector in the new coor- 
dinates. 
From formula (3) we get 
=S (4) 


We note an important special case similar to the transforma- 
tion of rectangular coordinates. Suppose the old basis e,, €p, ..., €, 
and the new basis &,, #,, ..., € are real and orthonormal; that is, 


(e; e e) =ô ij (5) 
and 

(£; g;) = ô; (5°) 
where ô; is the Kronecker delta. 

Then formula (1) implies 
Si = (€;, €;) R a EEE, (6) 

that is, the elements of the change-of-basis matrix S are direction 
cosines and can be specified by Table 24, 


TABLE 24 


THE COSINES OF THE ANGLES BETWEEN THE UNIT VECTORS 
OF TWO BASES 







Unit vectors of old system 








Unit vectors of new system 















e, e; en 
Sit S21 Sni 














Snn 


| = 
ae eae a 
| | 





Sin | Son 


Substituting expression (1) into formula (5°) we get, by formu- 
las (5), 


n 


n n 
(8), 8) = e Sij@i È sues) = pa SijSin= Ù jx 


Thus, (1) the sums of paired products of the corresponding direction 
cosines of different coordinate axes of the new orthonormal system 
` are zero, and (2) the sum of the squares of the direction cosines for 


350 Ch, 10. Essentials of the Theory of Linear Vector Spaces 


each new coordinate axis is unity. From this, 
SS=E (7) 


which means that the change-of-basis matrix from one orthonormal 


basis to another is orthogonal (for details concerning orthogonal 
matrices see Sec. 10.6). 


10.6 ORTHOGONAL MATRICES 


Definition. A real matrix A is called orthogonal if its trans- 
pose A’ coincides with the inverse A~}; thus 
A'= An (1) 
or 
AA'=A'A=E (2) 
An orthogonal matrix has the following properties. 
_ |. The rows (columns) of an orthogonal matrix are orthogonal 
in pairs. 
Indeed, if A=[a;,}, then from (2) we have 


n 
k=1 
and 


n 
D araj=0 for i#j 
kel 


2. The sum of the squares of the elements of each row (column) 
of an orthogonal matrix is equal to unity. 
From (2), for i=j, we obtain 


3. The determinant of an orthogonal matrix is equal to +1. 
Thus, on the basis of (2), we have 


det A det A’ = det E 
whence, since det A’ = det A and det E =1, it follows that 
(det A} = 1 
and, hence, that 
det A = +1 


4. The transpose and the inverse of an orthogonal matrix are 


also orthogonal matrices, This property follows directly from for- 
mulas (1) and (2). 


10.7 Orthogonalization of matrices 351 


10.7 ORTHOGONALIZATION OF MATRICES 
Suppose we have a matrix with real elements, 


Qir Mig +++ Ain 
Qo, Qog +++ Aon 


A= 


> » © è è ù 


Api Ang A 


We consider the columns of A as the vectors 
Ay 
aa; 

a= G=1, 2, ..., n) 


Ong 
thus, we can write this matrix in the form 


Te] 


Theorem 1. Any nonsingular real matrix A may be represented 
in the form of a product of a matrix with orthogonal columns by 
an upper triangular matrix: 


A=RT 
where R ts a matrix with orthogonal columns and T is an upper 
triangular matrix with unit diagonal. > 


Proof. For the sake of simplicity, we carry out the proof for 
the case when the order of the matrix is n=3. The reasoning 
however will be of a general nature. Let 


fe | a™ 





Qi Qis Ay 
A= Az Qz Ags 
es , y Qs, Azz Ogg 
Write this matrix as 
A= fa, a®, a 
ay; 
where a =| a,; | are column vectors. 


Qs; 

Since the matrix A is nonsingular, the vectors a®, a®, a@® are 
linearly independent. 

This is so because if these vectors were linearly dependent, 
then in det A one of the columns would be a linear combination of 
the other two and, hence, det A=0, which is impossible. 


352 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


We seek the matrix R also in the form 
R=[r, r», ry 


where r (j=1, 2, 3) are the required orthogonal columns. 
Set 


r? — gD (1) 


Now decompose the vector a into its components 7,,r'? and 
r™ > of which the first is directed along the vector r™, and the 
second is perpendicular (orthogonal) to 
it (Fig. 51); thus 


' a = tor) be) (2) 
where 
(r®, r»)=0 (2°) 


Similarly, decompose vector a into 
the three components ¢,,r%, ¢,,r and 
r®, of which the first two are directed, 
respectively, along the vector r‘ and 
r™, and the last is perpendicular both 
to the vector r™ and to the vector r'? 
(Fig. 51); thus 


a = hr P 4 fi) 4 (3) 





Fig. 51 


where 
(7™, r) = 0 and (r'®, r)= 0 (3’) 
From the construction it is evident that the vectors r®, r® 
and r‘® will be mutually perpendicular. From the systems (2) and 
(3), we determine the vectors r™ and r‘ and also the coeffi- 
cients f;; Multiplying both members of (2) scalarly by r'?=a™, 
we obtain, by virtue of the orthogonality condition (2’), ` 


(a, PD) = ty, (r2, r®) 


where 
(Fr), r”) z 0 
Hence 
TE (a, r®) 
12 — (#0, rD) 
and 


r® =Q — tr” 


Note that since the matrix A is nonsingular, the vector 
r»? =a® £0 and therefore (rV, r'Y)0. Moreover, r® 0 since 
otherwise the vectors @® and a” would be linearly dependent. 

Analogously, multiplying both members of (3) scalarly and 
successively by r™ and r®, we get, by virtue of the orthogona- 


10.7 Orthogonalization of matrices 353 


lity conditions (2’) and (3’), 
(a), r®)= fiz (rP, rip), 
(a, r@)=f,, (r®, ri?) 
From this, noting that (r™, r%) 40 and (r®, r®)0, we get 


(a), rb) (a, r®) » 
tis = (TD, Dy ro) , tos = Go, ro) 
and 
r® —=qa®— tr? — tyr P 
It is easy to verify that the thus constructed vectors r™, r® 
and r) are pairwise orthogonal. And so we finally have 


a —, ro, 
a® = tr) +, (4) 
a” = Her) + tar + 7 
where 
= (a, ri) R S 
(S (ro, ri) (i < f) 
and £ 


(r, rP)=0 for iæÆj 
The system (4) is clearly equivalent to the matrix equation 


Air Qis Mg Fir Fig aa l fie tis 
je Qas Ass |=| Fa Ta Faj |O l ba 
Qz Azo Ags Fai Fas 3s 00 1 
or i 
A=RT. (5) 


where R={r,,] is a matrix with orthogonal columns and T = [f;)] 
is an upper triangular matrix with unit diagonal. 


Example. Orthogonalize the columns of the matrjx 


0 1 2] 
A=j|1 2 | 
f 2 0 1 
Solution, Set 
0 
rY—) J =V 
2 


Then 


4 (a®, ro) 1 042-140-294 
r= (PO, poy OL Pp TS 


23 9616 


354 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


We now find that 


1 0 1 
r2—@)_t r= 2] 04 | 1] = 1.6 
lo 2} Į —0.8 


3 L 
To determine r®, compute f, and f, to get 
(a, r) 20401412 _ 


2 
f= (rb, ra 5 3 =04 
Fees Pe 2-140-1,6+1-(—0.8) _}2 93 
a3 re, ro) 12+ 1.674 0.8? TERT 


whence 


2 0 1 1.707 
r® = a® —i rY — tart =| 0| -—0.4| 1| —0.3] 1.6|=] —0.88 


1] 2 —0.8 "0.44 


0 1 1.7 1 0.4 0.4 
A=}1 16 —0.88);)0 1 0.3 
0 0 


12 —0.8 0.44 | jl 


- 


Thus 


and the vectors 


0 ] 1.7 
r®=|1|, rea 1.61, r®= | —0.88 


2 —0 8 0.44 


are pairwise orthogonal. This can be verified directly. 

In certain cases it is better to orthogonalize the rows of a mat- 
rix, regarding them as corresponding vectors. 

Suppose A’ is the transpose of A and is reduced to the form 


A’ = RT (6) 


where R is a matrix with orthogonal columns and T is an upper 
triangular matrix with unit diagonal. Taking the transpose of (6), 
we get 


A=T'R' (7) 


where T’ is a lower triangular matrix and R’ is a matrix with 
orthogonal rows. Thus, the above-described device for orthogona- 
lizing the columns of a matrix is also suitable for orthogonaliza- 
tion of rows, and we have the following theorem. 


Theorem 2. Every nonsingular real matrix can be represented in 
the form of a product of a lower triangular matrix with unit diago- 
nal and a matrix with orthogonal rows. 


10.7 Orthogenalization of matrices 355 


We give yet another technique for orthogonalizing the rows of 
a matrix, which is sometimes of greater practical utility [5]. Sup- 
pose we have a nonsingular real matrix 


Fay, Ay - ++ Ayn 
A voa Qa Qog SEE Asn 
1 Qni One iets Qin | 


From each ith row of A, beginning with the second, subtract 
the first row multiplied by a scalar An (i=2, ..., n) dependent 
on the number of the row. We then get the transformed matrix 


(1} (ly (i) 
Qil Qiz «1. Qin 
(1) iy (iy 
a; a eee F i 
Aw 3 21 22 an 
(1) m 2 QQ) 
Any Ong «++ Ang 


where aj? =a; for i= 1 and afp=a,,—A,,a,; for i> 2. 
Choose multipliers A; such that the first row of matrix A™ is 
orthogonal to all the other rows of the matrix. We have 


n n fh 
a 
2 ay) ap = > Ay; (aij — Ana) = x jaj — An p2 aij=0 
l= j= = 
whence 
n 
2 a177 


2 
aij. 


An = 


melt 


if 


i 


Perform the same operation with matrix A, namely, leave 
_ unchanged the first two rows, and from each ith row where / > 3, 
subtract the second row of A™ multiplied by a scalar Nie 
((=3, ..., 2), The new matrix is then 


aP aR ... a 

a’ a? Aat a 
A®— 21 an 
(2) (2) (2) 

Ani Ang +++ Anns 


where af? = aj? for i=1, 2 and af? =a —A,,a,) when i>3. 
Since the first row of A’) soineides. with the first row of A™ 

and all the other rows of A® are linear combinations of the rows 

of A® orthogonal to the first row of A™, then the rows of A” 


356 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


will also be orthogonal to its first row. We choose the multipliers 
Mig SO that the rows of A’, from the third onwards, are orthogo- 
nal to the second row. This yields 


ny n 


> aga = p> a} (a? — reat) = D aia — hn, > [ast]? 


whence 


fi a 


a 41, 
azj aif 


> 
i 
T 
= 
i 
& 
= 
N 


(A) 


pa í 


) La sy}? 


This process is ae till we get the matrix 


(nly (n—1) (A-1)5 
Oy Aig oe. Ain 
n=) pni; (n—1} 
a. a ES 
A”n-»— 7 2? un 
(2-1) (NI) (n—1) 
Qni Qng vee Any 


all the rows of which are orthogonal in pairs: 
p2 aya? =0 when ki 


The matrix A”’-Y = Ř with orthogonal rows was obtained from 
the given matrix A as a result of a chain of elementary transfor- 
mations. We therefore have the valid equality 


R=AA (8) 


where A is a nonsingular matrix, which in our case is a-lower 
triangular matrix. 

It is easy to restore matrix A-by taking the unit matrix £ and 
performing all the elementary transformations carried out with 
respect to A. From formula (8) we finally get 


A=TR 


where T—=A7! is a lower triangular matrix. 
We give some properties of matrices with orthogonal rows. 


Lemma. // the columns of a real matrix constitute an orthogonal 
system of vectors, then the product of the transpose of a matrix by 
that matrix is equal to the diagonal matrix. 


10.7 Orthegonalization of matrices 35? 


Proof. Let A=|a;,] be a given matrix. It is required to prove 
that A’A=D, where A’=[a,,] is the transpose of matrix A and 


d, 0... 0 

E E E 
D=| ‘fs a diagonal matrix. 

IO O0... dyn 


Assuming D= [d;] we have, by the rule of matrix multiplication, 
“A 
djj= 2 Ay (Ay; 


whence, since a; are the coordinates of the ith vector a’? and as; 
are the coordinates of the jth vector a, we get 


n 
dij = pa App = (a, a?)=0 if i#j 


Consequently, D=[d,,] is a diagonal matrix. 


Corollary. The product of a real matrix with orthogonal rows 
by the transpose of that matrix is equal to the diagonal matrix, 
that is, AA’ =D. 


Theorem 3. Any nonsingular real matrix A with orthogonal co- 
lumns is an orthogonal matrix postmultiplied by a diagonal matrix. 


Proof. By the lemma we have 
i A'A=D (9) 


where D=[d,;] is a diagonal matrix. If A={a,,], then obviously 


d= È ahi >> 0 
=! 


Put | we 
p=Vdy>O (i=1, 2, n) 
and 
p, 0...0 
0 sun 0 
a=] 
0 0... 9, 
It is plain that D=d?, From (9) we have A’A=d?, whence 

d“*A'Ad1=E 


Since (d~1)’ =d-}, it follows that (Ad-1)’ (Ad!) = E. Consequently, 


358 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


matrix Ad~?=U is orthogonal and, hence, 
A=Ud (10) 
which completes the proof. 


Corollary. A nonsingular real matrix with orthogonal rows may 
be represented in the form of a product of a diagonal matrix by 
an orthogonal matrix. 

Indeed, let A be a matrix: with orthogonal rows: then A’ is a 
matrix with orthogonal columns. By formula (10) we have A’ = Ud, 
where U is an orthogonal matrix and d is a diagonal matrix which 
may be determined from the relation 


AA’ =d@ 
From this we get 
AS(A'Y =d U' ad" 
where U’ is also an orthogonal matrix. 

Note, In order to transform a given nonsingular real matrix A 
with orthogonal columns (rows) into an orthogonal matrix, it is 
sufficient to normalize the columns (rows), which means that the 
element of every column (row) is to be divided by the square root 
of the sum of the squares of the elements of that column (row). 


For instance, if A=[a,,J} is a matrix with orthogonal columns, 
then the matrix 


A = [a;/], 


where ay =— = (i, j=l, 2, ..., 2) is an orthogonal 


n 
V 3 
k=E 


10.8 APPLYING ORTHOGONALIZATION METHODS TO THE 
SOLUTION OF SYSTEMS OF LINEAR EQUATIONS 





matrix. 


A. FIRST METHOD [ORTHOGONALIZATION OF COLUMNS) 
Suppose we have a system of linear equations 
Ax =6 (1) 


with a nonsingular real matrix A. Orthogonalizing the columns of 
A we obtain a matrix R; here, A=RT, where T is an upper 
triangular matrix. We have 


RTx =b (2) 


10.8 Orthogonaliz. methods for systems of lin. eqs. 359 


Premultiplying both members of (2) by R’, we get 
R'RT x = R'b (3) 
But, as we know, R'R =D, where D is a diagonal matrix. Intro- 
ducing the notation R'b=ß, we have 
DTx=B8 
whence 
x= (DT) 'B=T D318 (4) 
The matrix D-t, which is the inverse of the diagonal matrix, 
is found without difficulty; namely, if 


rd 0 ...0 
5 0 dp... 0 
0* 0 dan 
then 
dy 0 ...0 
D-1=|0 dz... 9 
0 O ... da} 


It is also relatively simple to find the inverse T~? of the trian- 
gular matrix T. 


Example 1. Using the method of orthogonalization of columns, solve 
the following system of equations: 
0.4%, +0.3x,—0.2x, = 2, 
0.6x, —0.5x, 4+ 0.3x, = 2.5, 
0.3x, + 0.2x, + 0.5x, = 11 


Solution, Represent the matrix A of this system in the form of 
a product of matrix R with orthogonal columns by a triangular 
matrix with unit diagonal: 


Ta Tig fig l ode Mg 
A= RT = l'un Tay Taf 0 1 hog 
Peis Ghigo, Egg 0! 2 I } 
Set 


rosa, pM QM hD, rO = gO ahar D —A, 


We have 
0.4 
r= 0.6 
0.3 


oC 


360 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


Using formulas (4) of the preceding section, we get 


_ (a, 7) 0.120.340.0802 
dns =m pa) = 0610.3670. ie gsi ~ — 0-1987, 


: 0.3787 
r= 40.1967 —0.3820 
i 0.2590 


Check: 
[0.4 ]' r 0.3787 0.1515 
(r®,r®)=| 0.6 — 0.3820 | = — 0.2292) = 0, 

0.3 | | 0.2590 0.0777 
(a, ro) —0.08+0.18-+0.15 0.25 __ 

=m Oe Oe) NOR, 

(a®), rP 0.07574 —0,11460+-0, 12950 _ 
Aas = TA, m= 0.35 ERA 


—0.2 0.4 ; 0.3787 —0. 2990 
ro— | 0.3 |—9,4098| 0.6 | 40,1714} —0.3820 | — | —0. 0114 

0.5 0.3 0.2590 0.4215 
Check: 


(Fr, r®)j= (re; r)) a 0 





Thus, ; 
0.4 0.3787 —0.29903 ri —0.1967 0.4098 
Aes 0.6 —0.3820 —0.0114;/0 1 —0.1714 
~ 10.3 0.2590 0.4215}10 0 | 


From formula (4) we have 
x=T™D-R’b 


where.D=R’'R is a diagonal matrix and 


2 
b=] 2.5 
11 


its inverse D7! we obtain the values 


0.610 0 1.640 0 
D = 0 0.35 0 and, D-1=]l0 2.81 0 
0 O 0.2672 0 0 


For the matrix D and 


3.75 


10.8 Orthogonaliz. methods for systems of lin. eqs. 361 


Furthermore, 
0.4 0.6 0.3 2 5.6 
po- o3 —0.3820 | | = ae 
0.2990 —0.0114 0.4215} L141 4.08 
Finally, we compute (in the usual way) 


1 0.1967 —0.3761 
T-§=/01 0.1714 
00 1 
The result is thus 
1 0.1967 —0.37611 71.64 0 0 5.6 5.0238 
=f l orm f 2.81 0 Mes- [sees 
00 l 0 © 3.75}|4.08] [15.0087 
and, hence, 
x, = 5.0238, x, = 10.0475, x, = 15.0087 
The exact values of the roots are x,=5, x,=10, x= 15. 


B. SECOND METHOD (ORTHOGONALIZATION OF ROWS} 


Suppose we have a system 
Ax=6 (5) 
where det A 0. 

Transform the rows of (5) by the device given in the preceding 
section so that the matrix A is transformed to the matrix R with 
orthogonal rows. Then the vector b. will pass into some vector B. 
As a result, we get the equivalent system 


Rx=B (6) 


x=RB (7) 
As we know, RR’=D=d?, where d is a diagonal matrix and 
R=dU, where U is an orthogonal matrix, and so 
R-t1=(dU)2=U-1d-1 = U'd'd-? = (dU) d~ = R'd-?* =R'D 
Thus, on the basis of formula (7) we finally have 
. x= R'D-'B (8) 


D=RR' (9) 


By using (8) we can avoid the most laborious process of find- 
ing the inverse of a nondiagonal matrix. The presence of D-t 
does not complicate matters because D is a diagonal matrix. 
Formula (9), which is needed anyway, may also be used for a check. 


Hence 


where 


362 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


Example 2. Use the method of orthogonalization of rows to solve 
the system of equations 
3.00x, + 0.15x, —0.09x, = 6.00, 
0.08x, + 4.00x, —0. 16x, = 12.00, 
0.05x,--}-0.30x, + 5.00x, = 20.00 


(1) 


Solution. Using the formulas of the preceding section, we deter- 
mine the multipliers: 








__ 3.00-0.08-++0.15-4.00-+-(—0.09):(—0.16) __ 0.8544 __ 

Agi = ~ 3,002.0. 152+ 0.09? = 9.9306 7 0-0946, 
_ 3.00-0.054+0.15-0.30—0.09-5.00 0.2550 __ 

der = 3007-0. 152+ 0.09? = — 9-9306 = — 0-0282 


Retaining the first equation of system (I), subtract from each 
subsequent equation the first multiplied by the corresponding mul- 
tipliers A, (i=2, 3): 





3.00x, + 0. 15x, — 0.09x, = 6.00 
—0, 2038x, + 3.9858x, —0.1685x, = 11.4324, | (11) 
0.1346x, + 0.3042x, + 4.9975x, = 20.1692 
For system (H) we determine the multiplier 
Bs —0.2038-0. 1346 + 3.9858-0.3042 —0.1685-4.9975 _ 0.3430 =0.0215 
„32 0.20382 + 3.98582 +0. 16852 15.9565 


Retaining the first two equations of system (II), subtract from 
the third equation the second multiplied by A., - 
- 3.00x,+ 0. 15x, —0.09x, = 6.00, 
—0.2038x, + 3.9858x, —0.1685x, = 11.4324, | (I) 
0.1390x, + 0.2185x, + 5.001 1x, = 19.9234 


3.00 0.15 —0.09 
R = | —0.2038 3.9858 —0.1685 


The matrix 


0.1390 0.2185 5.0011 


has orthogonal rows. As a check, form the matrix 


9.0306 0.0017 —0.0002 
D=RR’'=| 0.0017 15.9565 —0.0018 | ~ 
—0,0002 —0.0018 25.0780 


9.0306- 0 0 
x0 15.9565 0 


0 0 25.0780 


10.8 Orthogonaliz. methods for systems of lin. eqs. 363 
Using formula (8), we get 


3.00 —0.2038 0.1390 
x=R'DB=| 9.15 3.9858 0.2185} x 
—0.09 —0.1685 5.0011 


o 


i 0.1107 0 0 6.00 “1.957 
xio 0.0626 0 11.4324 | = | 3.126 
0 0 0.0399 | | 19.9234 3,803 


Hence 
¥,= 1.957, x,=3.126, x,=—3.803. 


C. THIRD METHOD (METHOD OF ORTHOGONAL MATRICES} 
Suppose we have a linear system reduced to the form 
Rx=ßĵ (10) 


where R=[rj] is a nonsingular matrix with orthogonal rows and 


Ni 


B, 
is the vector of constant terms. 
Multiplying each equation of the system (10) by the normali- 
zing factor 
g j] 


b= = 
y $? 
=i 


we get the system 


(i=1, 2, ..., n) 





Rx=B (1!) 
PB, 

where R=[p,r,,] is an orthogonal matrix and p= Pabo is the 
Hn Pr 


new vector of constant terms. 
From equation (11) we have 


x=RB= RB (12) 


364 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


10.9 THE SOLUTION SPACE OF A HOMOGENEOUS SYSTEM 
We now consider a homogeneous linear system of equations 


Q,%,+4,,.%,+...+4,,%,=90, 
gy Xy Faaa t -F antn = 0, (1) 


; ami F anat +++ + anak, =0 
or, briefly, 
Ax—0 (r) 


Xi 
where A=[a,,] and x=] : 
x 


| is the desired vector. 

li det A0, then, by the Cramer rule, the system (1) has a 
unique trivial solution *=0. 

Let det A=0. Then (1) has an infinite number of solutions 
(including nontrivial solutions). 

From formula (1’) it follows that: (1) if æ is a solution of (1’), 
then cx, where c is an arbitrary constant, is also a solution of 
this equation; (2) if x and x® are solutions of (1’), then the 
sum x) + x js also a solution of (1°). 

Indeed, if 
Ax =0 


then 
A (cx) =cAx=0 


Hence, cx is a solution of the homogeneous system (1). 
Similarly, if 
Ax? =0 and Axr™=0 


then 
A (xe) x)= Axo Ax» =0+0=0 


That is, x +% is also a solution of the homogeneous system (1). 

In this way we find that any linear combination of solutions of 
the homogeneous system (1) is also a solution of the system. Hence, 
the set of all solutions of (1) form a vector space which we call 
the solution space. The following theorem holds true. 


Theorem, /f n is the number of unknowns of a homogeneous sy- 
stem (1) and r is the rank of its matrix A, then the dimensionality 
of the solution space is k=n—r. 


10.9 Solution space of a homogeneous system 365 


The basis of a solution space is termed the fundamental system 
of solutions. Mi for system (1) the fundamental system of solutions 


cane iy (1) 1 
xD = (xi), Xs pores xP), 
x FANA (x? x am, ae Xn), 


Cy 


x = Gl, xe ...} a), 
is known, then all its solutions are contained in the formula 
H=C KY te HM+ .,. 46,0 (2) 
or, written out, : 
Hy = CX +e xf +... +e, 
X= CA +e xl +... fed”, 
Xn = CX + 0,4” +... +0,x 


where ¢,, C, ..-, Cp are arbitrary constants. 
To find the fundamental system of solutions, one isolates the 
nonzero rth order minor 6, in the matrix A. Suppose 


Ais Qiz +++ Gp 
Ay, Az -- Q 
21 “ez ar 
6, =a ’ r » . . . Æ 0 
ayy ar, bare a,, | 


This can always be done by interchanging the equations of system 
(1) and altering the numbering of the unknowns. It is then easy 
to prove that the equations of. (1), beginning with the (r+ 1)th, 
are consequences of the first r equations of the system; that is to 
say, they are satisfied if the first r equations of system (1) are- 
satisfied. It is therefore sufficient to consider the subsystem 


Oy Hy Qia F os - F yp Xp = — Oy, pt Xp ty Dn Xny 
Any Xy F AygXy F o + F Qarr = — Ay, pXpty—— + + + Aan Xny (3) 
anit App Xa bv ss F Opp Xp = — Ap, pri oe —OpyXq 


whose determinant 6, is nonzero. 
In the system (3) the values of the unknowns 


Geist, AH Gi cess RES He 


may be considered arbitrary. Solving (3) for the unknowns xi, X 


366 Ch. 10. Essentials of the Theory of Linear Vector Spaces 
tees Xp WE get 


Hy = Qala F Zila F -+ e F Oy ply, 


Xa = Ay ly Arala H oes F ekg (4) 
Xr = Arl E Orly am + He 
where ap (i= 1,2, ...,7; /=1,2,..., k) are quite definite constants. 
Moreover, 
Xp = Cy 
Mpg = Coy (4’) 
Ay = Cy 


Formulas (4) and (4’) yield the complete system of solutions of 
system (1). For the fundamental system of solutions we can take 


fau | E l Or, | 
Ary Are Ore 
1 0 : 0 
P= oh Kx popes s= 0 
0 0 0 
0 j LOJ Lij 


These solutions can be found directly from the system (3) if in (3) 
we successively set 


Aral, tage =H =O, 
£40) Seed, Sp Se HO, 
oe SE Ae tee Be OG EUR is ase 3 

Xr+ Xn = 0, = 1 








Example. Find the fundamental system of solutions of the following 
system of homogeneous equations: 
Xy— *,+5x,— x,=0,° 
Xy+ x, — 2x; + 3x, = 0, A 
3x — x, + 8x + x, =0, (5) 
X, +.3x, — 9x, H 7x, = 0 


10.10 Linear transformations of variables 367 
Solution. The rank of the matrix of system (5) r=2, and 
1 —-] 
è=; 1 
For this reason, the last two equations of (5) are consequences 
of the first two. We solve the subsystem 


X — 4 = —-5x,-+4%,, \ 
“+= 2x,—3x, J 





=2+40 


First assuming x= 1, x,=0, and then x,=0 and x,=1, we get 
two linearly independent solutions 


3 es i 
x= (—5, 9? L, 0), 
x®™—(—1, —2, 0, 1) 


which form the fundamental system of solutions of the system, (5). 

The vectors x® and x® form a basis of the solution ‘space of 
the given system, and all its solutions are determined by the for- 
mulas 


x, = — 3C — Cy, 
X= 1¢,—2cy, 
R= 2035 

X= Cy 


where c, and c, are arbitrary constants (for the sake of convenience, 
the first constant is taken in the form 2c,), 


10.10 LINEAR TRANSFORMATIONS OF VARIABLES 


Suppose *,, %,,..-, Xp are a set of variables and Y, Yas -s Yn 
are another set of variables connected with the first set by the re- 
lations i 


tha] Xo Xa oes Xah 
Yo= fe (Xa Kos ens Xy) (1) 
Yn = fn Wis Xo oers Xa) 
where f,, f,, ..., f, are given functions. 
We will use the term fransformation for a transition from the 
set %1,%,-.., x, to the set fi, Yo, co Yne 


Definition. The transformation (1) is called linear if the new 
variables Yy, Yes tare homogeneous linear functions of the 


368. Ch. 10. Essentials of the Theory of Linear Vector Spaces 


old variables *,, x,, ...,*,; that is, 
Yy = AyyXy + aya + > eae n? 
Y = A4 + Datat ver Can n (2) 


. e e s s s p č s 


Y, = TA Anı% + A,X ss a Baka 


where ap (i= 1, 2, ..., n; j=1, 2, ..., n) are constants. 
The linear ‘fanetor ation (2) is uniquely defined by its square 
matrix of coefficients (fransformation matrix) 


iy ig ae Qin 
A| Yee +++ Gan 
Gq Og ete Chee 


Conversely, having a square matrix A= [a,,], one can construct 
a linear transformation for which this matrix will serve as the 
matrix of coefficients. There thus exists a one-to-one correspondence 
between linear transformations and square matrices. 


The sets of variables x,, x, ..., *, and Yi, Yas... Y, may be 
regarded as column vectors 
x, 7 My, q 
X Yz 
x= and y =| ` 
Lin LYn J 


of the same vector space E,. For this reason, the transformation 
(2) maps the space E, onto itself or onto a proper subset of itself, 


Example 1. The transformation 
Y =X, +X, l 
Yo = xı + Xa 


carries the space E, into the subspace y, =y, of dimensionality 1. 
The relations W are equivalent to a single matrix relation, 


y=Ax (3) 
Indeed, by the rule for multiplying matrices we have 
E | ay% H 0ra F o e e F Ainin 


Yo = Ana F laga T. TT n 


Yn à Anı + Ang +. a On, 


40.10 Linear transformations of variables 369 


From this, via the notion of the equality of matrices, we get for- 
mulas (2). By formula (3), matrix A may be regarded as a linear- 
transformation operator. f 

It is easy to see that a linear transformation has two basic 
properties: 

(1)-a constant factor can be taken outside the sign of the li- 
near-transformation operator: 


A(ax)=aAx 

(2) the linear-transformation operator of a sum of several vectors 

is equal to the sum of the operators of these vectors; thus 
A(x+2)=Ax-+ Az 
As a consequence, we have l 
A (ax + pz)=a4Ax+pħAz 

where x and z are vectors, a and ĝ are scalars. 

Example 2. We have a plane Ox,x,. Suppose each vector 


oe 
Xa 


is associated with a vector 


lid 


which is the projection of vector x on the 
x-axis (projection transformation) (Fig. 52). "Uda h 
Show that the given transformation is a tne 
linear transformation and find the transfor- Fig. 52 

mation matrix. 


TREAD Ne ae TEL t} 





Solution, We obviously have 


and consequently the projection transformation is linear. The trans- 


formation matrix is 
A 10 


Let us investigate the meaning of the elements a,, of A. We 
consider the unit vectors directed along the coordinate axes Ox,, 


DA 9616 


370 Ch. $0. Essentials of the Theory of Linear Vector Spaces 


OF) oe ay OX: 
ri ron ron 
0 l 0 
6 = ’ e, = ’ s E, = 
LO) 04 LIJ 


rol 
A E E aj] 
Ae= [0 Am co ajad iY] g= oa) 
Ani Ang ++ ann | : van 
LO 


Thus, a; is the ith coordinate of the transformed jth unit vector. 


` Example 3. Suppose every radius vector x in the x,x,-plane is 
replaced by a radius vector y of the same length turned through 
an angle œ from x (rotation transformation) (Fig. 53). 


Typ ~~ YY Ye} 






Esin 0,005 a} 


Fig. 53 


Show that the given transformation is linear and find the trans- 
formation matrix. 


Solution. Connect vector y with a coordinate system Oy,y,, which 
is turned through an angle œ relative to the coordinate system Ox,*,. 
Since the coordinates of the vector y in the system Oy,y, are cle- 
arly x, and x,, the coordinates of this-vector in the old Ox,x, system 
are given, by familiar formulas of analytic geometry, by 


Y, =x, COS 4 — x, Sina, | 


(4) 


y= x sina +x, cosa | 


10.10 Linear transformations of variables 371 


The rotation transformation is thus linear and its matrix is of _ 
the form 

cosa —sing 

~ Ising COS & 


The problem can be solved differently: as a result of the trans- 
formation, the unit vector e, will plainly go into the vector 
ycosa, sina} and the unit vector e, will go into the vector {~sina, 
cosa}. Hence, by the foregoing, the transformation matrix is ` 


cosa — sina 
[sing cosa 
and we again arrive at formulas (4). 
We now define the following operations with the linear trans- 


formations A and B: (1) addition A+ B, (2) multiplication by a 
scalaraA, (3) composition BA: : 


(1) (A+ B)x=Ax+ Bx (5) 
(2) (2A) x =a (Ax) (5’) 
(3) (BA) x = B(Ax) (5") 


where x is a vector. 

It can readily be shown that if A and B are linear transforma- 
tions, then the resulting transformations 4+ B, aA, and BA are 
also linear. 

The following principle holds true: each operation from (1) to (3) 
involving linear transformations is associated with the same operation 
on the matrices of these linear transformations; this is taken to 
mean that the formulas (5), (5’) and (5") are to be regarded in 
their literal meaning. 

‘For the proof, we confine ourselves to the third case (5”) [for- 
mulas (5) and’ (5’) are verified more simply]. 

Suppose it is required to perform two linear transformations in 
succession: 


Y= yy X, +O hy te. Tli . 

Yo mais F aaea tee. + Ay Xn, (6) 

Un = By yXy + AgeX, F -e Fannin 
with matrix A= lay] and 

21 = buti F bihet -o F DinYns 

Z, = Oss t+ Daya F -+ bantnr» 


r 


Zn = Ons F Prote + es + Xandn 
_ with matrix B= [6,,]. 





(7) 


372 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


Denote by C=Jc;] the matrix of the composition of these trans- 
formations in the indicated order, that is, going from the variables 
Xis Xa, .--, %, to the variables z,, z,, ..., z, Writing the trans- 
formations (6) and (7) compactly as 


n= 2 PEF (k=1, 2, ..., 7), (6’) 


én) 
— 
wo 
= 
~ 


Zi= 2 bitk (i = (7’) 
fel 


and substituting formula (6’) into (7’), we get 


/ 


z;= 2 bir È ne j5 Èx D binary 8) 
pe j=l i=] k=l 


Thus, the coefficient of x, in the expression for z;, or element 


c; of the matrix C, is of the form 


n * 
oT 
= PT = baly bilat 0+. +O inn; 


We see that the element of C in the ith row and jth column 
is equal to the sum of the products of the corresponding elements 
of the ith row of matrix B and the jth column of matrix A, which 
means it coincides with the corresponding element of the product 
of matrix B by matrix A. Hence, C= BA. 

The proof is much simpler in matrix notation. Suppose 


Ay Y, “1 
x Y, z 
x= 7 , ļ=| ` and z=] * 
Xn Yn en 


are the respective vectors. From formulas (6) and (7) we have 
y= Ax and z= By 
whence l 
z=B(Ax)=(BA)x 
Consequently, the matrix of the resultant transformation C = BA. 


Example 4. Find the result of the following succession of linear 
transformations: 
Yı = 9X, — X, + 3X5, 
Yr =X, — 2%, 
Ys = 1X, — X3 


10.11. Inverse transformation 373 


and 
2 = 2y, + Ys» 
Z, == Y, — DY, 
Z = 2y, 


Solution. Write the transformation matrices 


5—1 3 20 1 
_A=/|1—2 0} and B=/01 -5 
0 Tel 02 ol 


Multiplying these matrices, we get the matrix of the resultant 
transformation 


20 1)[/5—-1 3] fio 5 5 
C=BA=|0 1—5/{i —2 ol]—| 1~37 5 
sioa ollo Ti 2nd -O 


And so the desired linear transformation has the form 
2, = 10x, + 5x,—~54;,, 
Z, == X1 — 37x, + 5X5, 
Z = 2x, —4x, 


10.11 INVERSE TRANSFORMATION 
Suppose a given transformation 


Yı = Qia + 045%, Fe F QinXn» 
Ya = aah eae es F gnn» (1) 


Yn = Anži -H dpa + ae + inka 


carries a set of variables x,, Xas ---, %, into a set of variables 
Yrs Yor ors Yw : 
Definition. The transformation which carries a set of variables 
Yr, Yor «+» Yn into the set of variables x,, x,, ..., x, is termed 
an inverse transformation relative to the transformation (1). 
We obtain the formulas of the inverse Tanoan by solving 


system (1) for the variables x,, x,, .. 
Let A; (i, j=1, 2, ..., n) be the etic og of the elements a,; 


of matrix A= [a,; J and 
det A= det ja; ]= A #0 


374 Ch. 10. Essentials of the Theory of Linear Vector Spaces 
Multiplying the equations of the system (1) by A,,, Aas 
A, Tespectively,.and adding, we get, by a familiar formula, 
Aud: + Any, + +» + Ant, = At, 
In similar fashion we derive 
Aas + Aga + +» + AnaYn= Akas 


-e e sos oe eo o ‘l ep‘ o o 


Arnt a Aant Foret Annin = AX, 

















whence 
A A A 
y= Fae Lee TA] 
ae Aa 
a ey > 6 # ep we A, | a 
X= Mts + ee Yn 


Thus, the inverse transformation of a linear transformation (if 
it exists) is also linear. 

Theorem, A linear transformation has a unique inverse if and 
only if the matrix of the given transformation is nonsingular. The 
inverse of a linear transformation is linear and its matrix is the 
inverse of the matrix of the initial transformation. 


Proof, Ii A=f[a;;) is the matrix of transformation (1) and A = 


= det A0, then the inverse exists and is defined by formulas (2). 
The matrix of the inverse transformation is clearly equal to 


(f= 


Ii A=0, then, by routine algebra, the equations (1) cannot be 
uniquely solved for the variables x,, x,, ..., Xa. Thus, a unique 
inverse transformation does not exist, and there will definitely be 
values of the variables y,, y,, ..., y, for which ake are mo cor- 
responding values of the variables x,, x,, ..., In this case, 
the linear transformation is called singular (tebe bie 


Note 1. Write the transformation (1) in matrix form 
y= Ar (3) 


where A= {a;,) is the transformation matrix and x and-y are 


column vectors. 
If the transformation A is nonsingular (det A0), then the 


inverse transformation 
x= Ay (4) 


10.12 Eigenvectors and eigenvalues of a matrix 375 


exists and each vector x of the n-dimensional space Ox,x, ... x, 
is associated, by virtue of formula (3), with one and only c one 
vector y of that space; thus, formula (3) transforms the space Ox,x, 

. x, into itself. 

If the transformation A is singular (det A=0), then (3) transforms 
the space Ox,x, ... x, into a subspace of a smaller number of 
dimensions, 


Example. Consider the projection transformation (see Sec. 10.10, 
Example 2) defined by the matrix 


alo a 


Here A is singular and the transformation y= Ax carries the space 
Ox,x, into the coordinate axis Ox,. 


Note 2 We take Ex to mean the identical transformation that 
leaves the vector x» unchanged. 
Since the relations 


y=Ax and x=Anty 
imply 
y=AAy and #= AAxx 
it follows that 
AA = AA =E 


10.12 EIGENVECTORS AND EIGENVALUES OF A MATRIX 


Given a square matrix A = [a;;}. Consider the linear transformation 
y= Ax (1) 


where x and y are n-dimensional vectors (column matrices) of, 
generally speaking, a complex n-dimensional space. 


Definition 1. A nonzero vector is called an eigenvector of a given 
matrix (or of the linear transformation defined by it) if as a 
result of an appropriate linear transformation the vector is carried 
into a collinear vector; that is, if the transformed vector differs 
from the original one only by a ‘scalar factor. 

In other words, the vector- x0 is an eigenvector of matrix A 
if the matrix carries # into the vector 


Ax =x (2) 
The scalar A in (2) is called an eigenvalue (or characteristic root, 


characteristic number, latent root) of the matrix A which eigenvalue 
corresponds to the given eigenvector x. 


376 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


Example 1. Let us consider the projection transformation in two- 
dimensional space Ox,x, defined by the matrix 


af 


Here the eigenvectors are (1) the nonzero vectors x directed along 
the x,-axis with eigenvalue 4,=1 and (2) the nonzero vectors y 
directed along the x,-axis with 

72 eigenvalue A,=0 (Fig. 54), 
Theorem 1. Every linear transfor- 
mation (matrix) in a complex vector 


I space has at least one real or complex 
eigenvector. 
r Proof. Let A be the matrix ofa 
- linear transformation. The eigenvect- 
0 z, ors of A are nonzero solutions of 
: the matrix equation 
Fig. 54 
Ax =hx 
or 
(A—AE) x =0> (3) 


where the matrix A—AE is called the characteristic matrix. Equation 
(3) is a homogeneous linear system which has nontrivial solutions 
if and only if the determinant of the system is zero; thus, the 
following condition must be met: 


det (A—AE)=0 (4) 


The determinant (4) is called the characteristic (secular) deter- 
minant of matrix A, and equation (4) is called the characteristic 
(secular) equation of matrix A. Expanded, the characteristic equa- 
tion (4) looks like this , 


Qi — À Oy, tee Ain 
(A Ao — À +++ Qan o 4) 
a a » Aan —h 


ni 


or 
MoA oA... + (— 1-310, A+ (— 1)"0,=0 (5) 


The polynomial in the left member of (5) is called the charac- 
teristic polynomial of matrix A. Its coefficients o,(i=1, 2,..., n) 
are defined by the following rules. The coefficient o, is equal to 


n 


the sum of the diagonal elements of matrix A, or o= >) a,;. This 
Fi 


10.12 Eigenvectoss and eigenvalues of a matrix i 377 


number is called the trace of matrix A and is denoted by o,=tr A. 
The coefficient o, is the sum of all diagonal second-order minors 
of A. Generally, the coefficient o, is the sum of all the diagonal 
kth-order minors of A. Finally, the constant term o, ‘is equal 
_ to the determinant of the matrix A: 


o,=det A 


The characteristic equation (5) is an algebraic equation of degree 
n in A, and, hence, as is proved’ in algebra, has at least one real 
or complex root. Let M, A,,...,4,(m<<n) be distinct roots of 
equation (5). These roots are called eigenvalues (or characteristic 
numbers) of the matrix A, and the set of all eigenvalues is termed 
the spectrum of the matrix A. Take some root A=A, and substitute 
it into (4) to get (A—A,E) x=0 or, in expanded form, 


(au — À) X F Aiaka H + + aX, = 9, 
yy Xy + (Ay, — A) Xp» + aonn = 9, 


eo o L Sop cow w 


yy Xq F Ana% + yee + (Qin — A) %, = 0 } 


Since the determinant of system (6) det (A—A,E)=0, this system 
definitely has nontrivial solutions, which are the eigenvectors of A 
that correspond to the eigenvalue à, If the rank of the matrix 
A—A,E is equal to r (r <n), then there are k=n—r linearly inde- 
pendent eigenvectors 


(6) 


rP, ees), Sidhe KED 
associated with the root 4, The theorem is proved. 


Note. It may be proved that the number of linearly independent 
eigenvectors corresponding to one and the same root of the charac- 
teristic equation does not exceed the multiplicity of that root. 
One thing that this implies is that if the roots of the characteristic 
equation (5) are distinct, then each eigenvalue is associated with 
one and only one eigenvector to within the factor of proportionality. 


Example 2. Find the eigenvalues and the eigenvectors of the matrix 


7E 
A=/1 2 1 = 
1 1 2 


Solution. Set up the characteristic equation of the matrix A: 


2—rA 1 1 
1 2—4 1|=—0 
i 1 2—À 


378 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


whence (A—1!)?(4—A)=0 and à =,= 1, 4,=4, Take A,=1 and 
put it into equation 


(A—1,E) *=0 (7) 
to get 

l Tt lliz 

l l lji] = 

1 e 
or 

4, +x, +x = 0, 

stares | (8) 

x, 4%. + x =0 


Since the rank of the matrix of system (8) r = 1, two of the equations 
are consequences of the third (which, incidentally, is obvious). It 
therefore suffices to solve the equation 


X,+4,+%,=—0 
Putting x,=c,, X, =C, we get 
x= — (a +c) 
where c, and c, are arbitrary scalars not simultaneously zero. 
In particular, first choosing c= 1, c,=0 and then c,-=0,c,=1, 


we will have the simplest fundamental set of solutions consisting 
of two linearly independent eigenvectors of matrix A 


1] 0 
xP? =| OQ} and x®?=]| ] 
‘ 1. —1 


All the other eigenvectors of A that correspond to the eigenvalue 
à == 1 are linear combinations of these basis vectors and fill the 
plane spanned by the vectors x® and x” (with the exception of 
the origin). 

Now take 4,=4. Substituting this value into (7), we get 


ae ee Se 
1—2 I1j{!x,|=0 
1 1-2; %, 

—2x,+ x, + Ko | 


or 


X,— 2X, + x;=0, 
x+ x,—2x,=0 


(9) 


10.12 Eigenvectors and eigenvalues of a matrix 379 


The rank of the system (9) is r=2, and the upper left minor. 


—2 | 


EE 








Thus, the third equation of the system is a consequence of the 
first two equations, and so we can confine ourselves to a system 
of the first two equations: 

—2x; + Xa+% = 0, 
x, — 2x, +x, = 0 


whence 
a *e e un eA 
= l —? j 
—2 ue = lier 1 t 
or 


Z= =; that is, 4 =x, =%,=C 


where c is a constant different from zero. 
Putting c= 1, we get the simplest solution that effects the eigen- 
vector of A 
1] 


. gl] |] 
l 


-Definition 2. A linear subspace E, (k <n) is said to be invariant 
with respect to a given linear transformation 


y=Ax 
if each transformed vector of this subspace also belongs to it; — 
thus, x€ E, implies Ax € E}. 
The proof of Theorem 1 clearly holds if we consider, in some 
invariant space, a linear transformation defined by matrix A. 


Theorem 1’, Every linear transformation defined on an invariant 
subspace of a complex vector space has at least one eigenvector. 
We note one more important property of eigenvectors. 


Theorem 2. The eigenvectors of a matrix which correspond to 
pairwise distinct eigenvalues are linearly independent. 


Proof. Let A be a given matrix and 


AxQ=hae? (j=1, 2,..., m) (10) 
where 
x20 and AA, for jek 


380 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


Suppose that 
cc) Lex] 4... 0,4 =0 (11) 


where |c,|+ |c,|4-...+]|¢n|40. 
For the sake of definiteness, let c,40. Applying the transfor- 
mation A to (11), we get, by virtue of formulas (10), 


Ace Y ACX +... +4,0,0 =0 (12) 
Now, multiplying (11) by 4, and subtracting (12) from the 
resulting equation, we find 
(Aig — An) Cy) + (An — Ag) C4 + noes + (Mg Ana) Ca- XTP =0 
; l (13) 
Then, from (13) we can eliminate the vector x»? in similar 
fashion, and so on. Thus, eliminating the vectors 
x”, KMD, eure y» 
we get 
(AnA) Anim) o Ahaa (14) 
But this equation is impossible because not one of the factors 
of the Jeft-hand member is equal to zero. Thus, our assumption 
is false and the eigenvectors x, x®,..., #'” are linearly inde- 
pendent. ; 
Corollary. If al] the eigenvalues of a matrix A of order n are 
pairwise distinct, then the corresponding eigenvectors of this mat- 


rix (n altogether’) form a basis of an appropriate n-dimensional 
space. 


10.13 SIMILAR MATRICES 


Definition. Two matrices associated with one and the same linear 
transiormation in different bases are called similar. 

If a matrix A is similar to a matrix B, we write AWS. Let 
us derive a condition for the similarity of two matrices. Suppose, 
in a certain basis, matrix A represents the linear transformation 

yr=Ax (1) 

In a new basis (in new coordinates), this same linear transfor- 

mation will be described by another matrix, B: 
n= BS (2) 
where 


BA 


DÐ It is assumed that one eigenvector is taken for each elgenvalue. 


10.13 Similar matrices 381 


Denote by S the change-of-basis matrix from the new system 
to the old system; thus, let 


£=S, y=Sy (3) 
where 
det S40 
Putting formulas (3) into (1), we get 
Sy = ASE 
Then, premultiplying the last equation by the inverse matrix S-?, 
we get n= S-A 4) 
Comparing formulas (4) and (2), we obtain 
B=S"'AS (5) 


With regard to the matrices A and B connected by the rela- 
tion (5), we say that the matrix B is obtained from A via a trans- 

formation by means of matrix S. We thus conclude that two mat- 
` rices are similar if and only if one of them is obtained from the 
other by a transformation effecied by some nonsingular matrix. 

From (5) we derive A=SBS™!, which is to say, if matrix B 
ts similar to matrix A, then also, conversely, matrix A is similar 
to matrix B. The following are some properties of a transforma- 
tion by means of a matrix S. 

1. The transformation of a sum is equal to the sum of the trans- 
formations: 


S74(A+B)S=S"'*AS+S7'BS 
2. The transformation of a product is equal to the product of the 
transformations of the factors: 
S~1(AB)S=S71AS-S"1BS 


3. The transformation of an inverse matrix ts-equat to the inverse 
of the transformed matrix: 


SA'S = (S71AS)71 

4. The transformation of an integral power (positive or negative) 

is equal to the same power of the transformation: 
S-1A4"sS = (ST AS)" 

Theorem 1, Similar matrices have the same characteristic polyno» 
mials. 

Proof, Let Ban A. It is required to prove that 

det (A—AE) = det (B—AE) 


382 Ch. 10. Essentials of ihe Theory of Linear Vector Spaces 


Since 
B=S"1AS_ (det S0) 
then 
det (B—AE) = det [S7 (A—AE) S] = 


= det S~! det (A —AE) det S = det (A—AE)” 
Thus 
det (B —AE) = det (A —AE) 


Corollary 1. Similar matrices have identical traces and the same 
eigenvalues (including their multiplicities). 


Corollary 2. The property of a vector to be an eigenvector of 
a given linear transformation is independent of the choice of basis. 
Indeed, suppose 


Ax =hx (x = 0) 


If in the new basis the vector x is equivalent to the vector §, 
then we have 


x=S§ 


where S is the change-of-basis matrix. 

From this we have AS =AS€ and, hence, STIAS =A, which 
means that § is an eigenvector of the matrix B =S ASen A that 
describes our linear transformation in the new basis. 


Note. Since the characteristic polynomial, the eigenvalues and 
the eigenvectors are the same for all matrices representing a given 
linear transformation, they are called, respectively, the characte- 
ristic polynomial, eigenvalues and eigenvectors of the linear trans- 
formation itself. 


Theorem 2, /f a given square matrix of order n has n linearly 
independent eigenvectors, then, taking them for basis vectors, we 
obtain a diagonal matrix similar to the given one. 


Proof. Suppose we have a square matrix A. Form a basis from 
the eigenvectors @,, €, ..., @, Since e; are eigenvectors, it fol- 
lows that 


Ae, =he, GS) 2).a%5 37) 


D Here we make use of familiar theorems (see Secs. 7.2 and 7.4): (1) the 
determinant of the product of two square matrices of the same order is equal 
to the product of the determinants of the matrices; (2) the determinant of an 
vee matrix iš equal to the reciprocal of the determinant of the original 
matrix 


10.13 Similar matrices 383 


Consider any vector x of our space. Decomposing it into its 
basis vectors e; (j=1, 2, ..., n), we have 


where x; are the coordinates of the vector x in the given basis. 
Applying the transformation A to the vector x, we get ; a new 
vector 


n 
y=Ax=A > xe 
j=l 
or, since the transformation A is linear, 
n n 
2 
= i= 


whence we see that the coordinates of the vector y in the given 
basis are 


or 


Y= 2 B igh Xp 


where 6,, is the Kronecker delta. 
Thus, in the new basis the transformation matrix is the diago- 
nal matrix 


A = (8j44)) 
or, expanded, 
Ko 05.680 
O A 0...0 
A= rr » č rẹ ēěć ‘œ> 
0 0 0... A 


Corollary. Any square matrix whose eigenvalues are pairwise 
distinct can be reduced to diagonal form by means of a similarity 
transformation. 

_ This result follows directly from Theorem 2 of the preceding 
section. 


384 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


10.14 BILINEAR FORM OF A MATRIX 


Let A= [a,,] be a real square matrix and let x, y be vectors 
in an n-dimensional complex space. Form the scalar product 


(4x, = Sanju = $ È are ) 


Expression (1) is called a bilinear form of matrix A. 

Let us derive an important property of a bilinear form. The 
sum (1) will plainly have the same value if we change the. order 
of summation and at the same time interchange the summation 
indices. We therefore get 


(Ax, y) = 2 = Api X Ye 


Let us write this sum as a scalar product: 


non mon * 
(Ax, y)= p2 PA Oy 5X Ye = (3 È ayni) =(A'y, x)" =(x, A'y) 
Thus, , 
(Ax, y)=(x, A'Y) (2) 
This means that in the scalar product (1) the real matrix A may 


be moved from first to second position if we replace it by its trans- 
form. 


Corollary. If matrix A is real and symmetric (A’ =A), then 
(Ax, y)=(*, Ay) (3) 


In a scalar product, a real symmetric matrix may be moved from 
first position to second. 


10.145 PROPERTIES OF SYMMETRIC MATRICES 


Theorem 1. All the eigenvalues of a symmetric matrix with real 
elements are real. 


Proof, Let 4 be an eigenvalue of matrix A and x the correspon- 
ding eigenvector; thus, 
Ax=he (x ~90) (1) 
Since ÆA’ = A, then 
(Ax, x)=(x, Ax) 
or, by (1), a 7 
(Ax, x) =(%, Ax) 


10.15 Properties of symmetric, matrices 385 


whence 
A(X, X)SA* (x, x) 
The eigenvector is nonzero by definition, and so 
(x, x)=Æ0 
and, consequently, A=A*, or À is a real number. 


Corollary. The characteristic equation of asymmetric real mat- 
rix has only real roots. 


Theorem 2. The eigenvectors of a symmetric real matrix that 
correspond to distinct eigenvalues are orthogonal among themselves. 


Proof. Let A be a symmetric real matrix. Consider two eigen- 
vectors * and + associated with the eigenvalues A; and 
A; Arh). We have : 
Ax = he (2) 
and 

Ax = hx? (3) 

Form the scalar product 

(Ax, KP) = (HO, Ax?) 

From this, by (2) and (3), we get 
(Ax, KM) (x, A) 
and 

A(x, KD) = MM (x, x xi) (4) 

Since on the basis of Theorem 1 the eigenvalue A, is real, it 

follows that Aj =A.. 


Hence, from (4); we have 

(2) (2%, #7) =0 

But 

AA, AO 

and sọ- 
(x, x'P)=0 

or the eigenvectors *« and x‘ are orthogonal among themselves. 


Note. The eigenvectors of a symmetric matrix with real elements 
may be assumed to be real. 


Theorem 3, Any real symmetric matrix can be reduced to diago- 
nal form by means of a similarity transformation. 

Proof, For the purpose of pictorialness, let us confine our proof 
to three-dimensional space, E,. 

Suppose we have a symmetric matrix A of order three. As we 
know, every matrix has at least one eigenvector (see Sec. 10.12,. 
25 9616 


386 Ch. 10. Essentials or the Theory of Linear Vector Spaces 


Theorem 1). Denote by e, an eigenvector of matrix A. Since A is 
symmetric, the eigenvector e, can be chosen real. 
Consider all eigenvectors x that are orthogonal to e, that is 


such that 
(x, €,)=0 (5) 


We will show that they form an invariant subspace E, with res- 
pect to the transformation A (Fig. 
55). ' 

First of all, if x € E, and y € E, 
that is, 

(x, €)=(y, e&)=0 

then for arbitrary scalars æ and B 
we have 


(ax + BY, €,)=a (x, e,) -+ 
+B(y, e,)= 
and, consequently, 
ax + By EE, 
Thus, the set of eigenvectors sa- 
tisfying Condition (5) form a li- 
near space, and it is easy to see 


that this space is two-dimensional. 
Now let x€ E, Consider the scalar product 


(Ax, e)=(x, Ae, =(x, Me) =^ (x, @,)=0 


e; 





Fig. 55 


Thus, 
AxEE, 


By Theorem 1’ (see Sec. 10.12), there also exists in the two- 
dimensional space E, an eigenvector e, of matrix A. Now consi- 
der the eigenvectors x that are orthogonal both to e, and to e,, 
that is, such that 


(x, @)= (x, &,)=0 


Similarly it is proved that these eigenvectors form a one-di- 
mensional space E, that is invariant under the transformation A. 
Again, space £, has an eigenvector €, of matrix A, The eigenvec- 
tors @,, z &, are pairwise orthogonal and so are linearly inde- 
pendent. We ‘thus construct an orthogonal basis for the space E, 
consisting of the eigenvectors of the matrix A. 

Denote by A, the eigenvalues associated with the eigenvectors e,. 
By Theorem 2 of Sec. 10.43, the matrix A of the given linear 
transformation will be diagonal in the proper basis @,, €,, €, and 

} 


10.15 Properties of symmetric matrices _ 387 


4, 0.0 
Au|0 Ay 0 
0 0 h 


k% 


in our case, 


The proof of the theorem in the general case is analogous. 


Corollary 1. For every linear transformation with a real sym- 
metric matrix there is an orthogonal basis (consisting of real 
eigenvectors of the given matrix) in which the transformation 
matrix is diagonal. 


Corollary 2. If the matrix is symmetric, then every eigenvalue 
is associated with as many linearly independent eigenvectors as is 
the multiplicity of the eigenvalue. 


Theorem 4 (extremal property of eigenvalues). 
Let A be a real symmetric matrix and 


A= min (Ay, Aps ++ ey Ande 
A= max (Ay, he oor An) 
where h,, Ag, ..-, A, are all the eigenvalues of A. 
Then the following inequality is valid for any vector x: 
a(x, x) < (Ax, x) <A(x, x) (6) 
Proof. By virtue of Corollary 1 of Theorem 3, there is a set of 
eigenvectors @,, €, ..-, @, of the matrix A: 
Ae;= he; (=l, 2, wey n) 


which form an orthonormal basis of the space £,. Then any vec- 
tor x can be represented in the form 


X= Xe +42, +... +%,€, 


where x,, Xa» ..-, X, are the coordinates of vector w in. the given 
basis, whence 


Ax =x,Ae€,+x,Ae,+...+%,Ae, = A,X, + Aaea + 26. 4A Xp 


Taking into account the orthogonality of the vectors of the basis, 
we have 


or 


388 Ch. 10. Essentials of the Theory of Linear Vector Spaces 
Replacing A; by the least value of 4 in (7), we obtain 
(Ax, x AE j Pah, x) 


Similarly, putting in (7) the greatest value of A in place of À, 
we find 


(Ax, s) <A Dx PHA, x) 
ic 


And inequality (6) is proved. 


Corollary. The minimal eigenvalue A and the maximal eigenvalue 
A of the symmetric real matrix A are, respectively, the smallest 
and largest values of the quadratic form 


u=(Ax, £) 
on a unit sphere (x, x)=1. 
Indeed, putting (x, in ee (6), we have 
Moreover, if Ax =Ax, then 
(Ax, x)=(Ak, x)=} 
Similarly, if Av Aw, then 
(Ax, x)=(Ax, x) =A 
Thus, 
XN=min(Ax, x) for (x, x)=i 
and 
A= max (Ax, x) for (x, x)=1 


A symmetric real matrix A= [a;] will be called positive definite 
if the corresponding quadratic form 


u= (Ax, 2) = X È oror 


is positive definite (see Sec. 8.13), that is, for any vector x0 
we have 


(Ax, x) >0 
Theorem 5. A symmetric real matrix is positive definite if and 
only if all its eigenvalues are positive. 


Proof. If A is a symmetric real matrix and its eigenvalues 4, 
are such that à >0 (j=1, 2, ..., 2), then on the basis of for- 


10.16 Properties of matrices with real elements 389 


mula (7) we have, from the proof of the preceding theorem, 
(Ax, x)= DA |x 
jar 7 


where *=(x,, Xa +., *,), whence for x-0 we have 
(Ax, x) >0 


Thus, the matrix A is positive definite. 
Conversely, let A be a positive definite symmetric real matrix. 
By Theorem 1, all its eigenvalues Mm, Ap» ..., 4, are real, and 


A= min (A, Aas ee ey AQ) 
is the smallest value of the quadratic form u=(Ax, x) on the 
sphere (x, *)=1. Since the sphere (x, *)=1 is a compact boun- 
ded set and the quadratic form u is continuous and positive on 


this sphere, then by the Weierstrass theorem, there is a smallest 
value of u and it is positive; that is, 


4 > 0 
whence, all the more so, 
A,>O for j=1, 2,..., n 
We give without proof the conditions of a positive definite real 
matrix [2]. 
Theorem 6. For areal matrix A= |an}, where ay= 4, to be po- 


sitive definite, it is necessary and sufficient that the following Syl- 
vester conditions be fulfilled: 








0 A Gy, Up 
A, = ayy > ’ 2 > 0, , 

Ay, Ang 

Qi Ayn +++ Ain 

aa Azz Aen 

i ee >0 
[i 
Lani . ang ace Qan 


Thus, a symmetric real matrix A is positive definite if and only 
if all the principal diagonal minors of its determinant det A are 
strictly positive. 


“40.16 PROPERTIES OF MATRICES WITH REAL ELEMENTS 


In the sequel we will, as a rule, consider matrices A= [a,j] 
whose elements a; are real. These are called real matrices. i 
Let A= [ap] be a real square matrix of order n. Since its cha- 


390 Ch. 10. Essentials of the Theory of Linear Vector Spaces 


racteristic equation 
det (A—AE) = 0 


is a polynomial with real coefficients, the roots A,, As, ..., A, of 
the characteristic equation, which are the eigenvalues of. the matrix A, 
are conjugate in pairs if they are complex (see Sec. 5.1); that is, 
if 4, is an eigenvalue of A, then the conjugate Aj is also an eigen- 
value of the matrix A and is of the same multiplicity. 

A real matrix may not have real eigenvalues. However, in one 
important case when the elements of the matrix are positive, the 
existence of at least one real eigenvalue is guaranteed [6]. 


Perron’s theorem, 7f all the elements of a square matrix are 
positive, then the numerically largest eigenvalue is also positive and 
is a simple root of the characteristic equation of the matrix; it is 
associated with an eigenvector with positive coordinates. 

Eigenvectors of a real matrix A with distinct eigenvalues are 
complex in the general case and do not possess the property of 
orthogonality. However, by invoking the eigenvectors of the transpose 
A’ we can obtain the so-called biorthogonality relations, which for 
the case of a symmetric matrix are equivalent to the ordinary re- 
lations of orthogonality. 


Theorem 1. // a matrix A is real and its eigenvalues are pairwise 
distinct, then there exist two bases, {x;} and {x;}, of the space E; 
consisting respectively of the eigenvectors of matrix A and the eigen- 
vectors of the transpose A' which satisfy the following conditions of 
biorthonormalization: 


„ JO for jk, 
Proof, Let 4,, A,, ..., 4, be the eigenvalues of the matrix A. 


Since A is real, we know that its eigenvalues are conjugate in 
pairs, which means that besides the eigenvalue 4, the enlueet: A5 


is also an eigenvalue of A. Denote by x, (i=1, 2, ..., n) the 
corresponding eigenvectors of matrix A, eae is, 
Ax ;=)jx; G=1, srep) (1) 


The eigenvectors {x,} form a basis for the space E, (Sec. 10.12, 
Theorem 2, Corollary). 

Since a determinant remains unaltered under an interchange of 
tows and columns, 


det (A’ —AE) = det (A—2E) 


and, consequently, the transpose A’ has the same eigenvalues 4, 
as the matrix A. Let x; (j=l, 2, ..., n) be the eigenvectors of 


* 40.16 Properties of matrices with real elements 391 
the matrix A’ which correspond to the conjugate eigenvalues ^$, 
that is, l 

Ax =x; (G=1, 2, ..., n) (2) 


The eigenvectors {«;} aiso form a basis for the space E,,. 
The bases {x,} and {x;} are biorthogonal, namely: 


(x; x) =0 for jk (3) 
Indeed, on the one hand, we have s 
(AX, Xi) = (MX p Kr) =M (Xj Xr) (4) 
On the other, taking into consideration that A is -real, we get 
(AX j, Xa) = (Xj AX) = (Xp, Me) = Ag (Hj, Xe) (5) 

From (4) and (5) we derive 


Since 4,44, when jk, from (6) follows (3). We will show that 
the eigenvectors {x,} and {xj} may be normalized so that 


(x, *j))=1 (hy 2s eyo) (7) 


Indeed, decomposing x, in terms of the eigenvectors of the basis 
{Xi Xo ..., Xn} we have 


X= akit ee HOG oe HOH, 


whence, taking into account the biorthogonality condition (3), we 
get 


(xp X= cp (Xj, Hi)+... +07 (x, x/)+... 
oo FOR (Hj, Hn) = CF (*,, ¥j) > 0 
and so 
(x, ¥))=a,40 
Taking the eigenvectors = Ki ee be mee in place of £i, 1, Xh, 


we get the required normalization (7) since 
l l j 1 ; 
(x; arian Rawal (=h kat 


Thus, if the eigenvalues of the real matrix A are distinct, then 
for the proper basis {x;} of matrix A it is always possible to find 
a proper basis {xj} of the transpose A’ such that 


where 6,, is the Kronecker delta. 


392 €h. 10. Essentials of the Theory of Linear Vector Spaces 


Corollary. lf the matrix A is real and symmetric (A’= A), then 


we can put w;=4, (j=1, 2, ..., n) where x, are the normalized 
eigenvectors of A (see Sec. 10. 15). Then 
(Hj, Xr) = Spe 


Let us now derive the so-called bilinear expansion of matrix A. 


Theorem 2. Let A be a real square matrix and let 
Xij 
X= 
| tay 


G=1, 2, ..., n) be its eigenvectors regarded as column matrices 
and let 


Xp= [Xie --- Xak] 
(k=1, 2, ..., n) be the corresponding” eigenvectors of the trans- 
pose A’ regarded as row matrices; also let the conditions of biortho- 
gonality (8) be met: 
(Xj, Kr) = XX y= Oye . 

Then the following relation is valid: 

A=A,X,X{+4,X,X,+...4+4,X,X0 (10) 
where M, Ay, ..., An are the eigenvalues of matrix A, 


Proof. We consider the matrices 


Xu | Hy, Xia + Xm 
ala 2 a aeea 
Xni o Kpn Xino Xan 


which consist of the columns X; G=1, ..., n) and rows X; 


(k=l, ..., n), respectively. 
By (9) we have 
X'X = BS xuruj] =[X;X;] = [ő] =E (11) 


where E is the unit matrix. Since the matrix X consists of line- 
arly independent columns, it is nonsingular, that is, det X 0 
and, hence, there is an inverse X~!. On the basis of (11) (see 
Sec. 7.4, Theorem, Note 1) we have 


ae, 


D That is, corresponding to the same eigenvalues of the matrices A and A’. 


10.16 Properties of matrices with real elements 393 


which implies that 


XX’=E 
and, thus, we obtain the second set of biorthogonality relations |7]: 
Bates 
A Xin jk = ô; (12) 


Using these relations, we have 
XX, + X:X + tse +X,Xa= [XX jr] + [XX jo] + cre + [Kinin] = 


= & vaca =[ô;] =E 
That is, 
E=X,X,4+X,X,+...+X,X) 
Premultiplying this equation by A and taking into account that 
AX ,=4,X, (j=1, 2, ..., n) 


we will clearly get equation (10). 

It will be well to note that in formula (10), X, and X; (j=1, 
2, ..., A) are the eigenvectors of the matrices*A and A’ corres- 
ponding to ome and the same eigenvalue À, despite the notations 
of Theorem 1, where x, and x; are the eigenvectors of the matri- 
ces A and A’, which eigenvectors correspond to the complex con- 
jugate eigenvalues 4, and Aj. 


REFERENCES FOR CHAPTER 10 


[1] G. E. Shilov, Introduction to the Theory of Linear Spaces, 1952, Chapters 
I-IX (in Russian). 

[2] 7. M. Gelfand, Lectures on Linear Algebra, 1951, Chapters l-I 
(in Russian}. 

[3] A. /. Maltsev, Principles of Linear Algebra, 1948, Chapters I-II 
(in Russian). 

[4] Alston S. Householder, Principles of Numerical Analysis, 1953, Chapter 2, 

[5] Yu. A. Shreider, The Solution of Systems of Linear Algebraic Equations, 
1951 (in Russian). 

[6] F. R. Gantmacher, The Theory of Matrices, 1953, Chapter VIII (in Russian); 
there is an English translation published by Chelsea, New York, 1959, 

[7] V. N. Faddeyeva, Computational Methods of Linear Algebra, 1950, Chapter I 
(in Russian), 


*Chapter 11 


ADDITIONAL FACTS ABOUT THE CONVERGENCE 
OF ITERATION PROCESSES FOR SYSTEMS 
OF LINEAR EQUATIONS 


11.1 THE CONVERGENCE OF MATRIX POWER SERIES 


Theorem 1. The matrix power series 


È a,X* (1) 
EZD 
with numerical coefficients a, converges if all eigenvalues M, Ma, ..., A, 


of matrix X are located in the closed circle of convergence |x| < R 
(Fig.-56) of the scalar power series 


2 aa ( 2 ) 


(x=E+in); and the eigenvalues on the 
circumference of the circle of convergence 
are simple and are the points of conver- 
gence of the series (2). 
Series (1) diverges if even one eigenva- 
lue of matrix X lies outside the closed 
Fig. 56 circle of convergence of series (2) or if 
there is an eigenvalue of matrix X- lying 
on the circumference of the circle of convergence for which series 
(2) diverges. 


Proof. (1) Let the matrix X be such that 





MISR, ++. [ASR 
For the sake of simplicity we assume that the eigenvalues 4, 
(j=1,2,...,) of matrix X are simple. Then the matrix X can 


be reduced to diagonal form with the aid of the nonsingular 
matrix S: 


RC coh S 


11.1 Convergence of matrix power series 395 


Introducing the notation 


Ms 


E (X)= a,X*, fn (x)= 5 a,x® 
k= R=0 
and 
F(x)= lim fi, (x)= È aet 
we have 
Fa x) = $ a, (s- ey Ay] SF = 
k=0 
Ta ma [As <0 MJ} S= S= a Oa os el SB 


Since the eigenvalues A, lie within the circle of convergence of the 
power series (2) or coincide with the points of convergence of that 
series belonging to the circumference of the circle of convergence, 
there exist finite limits 


f(A) = sy Fin (hj) (j= 1,2, ..., 1) 


For this reason, passing to the limit in (3) as m— œ, we obtain 
mC Ue TF Ags er PADS (4) 


Thus, the matrix series (1) converges at the point X. 

It can likewise be proved that the assertion of the theorem is 
also true for the case of multiple eigenvalues A, but we will not 
examine this case [1]. 

(2) Suppose, for instance, that at least one eigenvalue 4, of 
matrix X is such that 


[MI> R 

Since A, lies outside the circle of convergence of the power series 
(2), it follows that f,,(A,}) has no limit as m— œ. Formula (3) 
implies that F,,(X) likewise has no limit as m-— oo; thus, series 
(1) diverges at the point X. 

A similar result is obtained if |4,|=R and the series J, a,A% 

k=0 

diverges. 


Note, Formula (4) implies that if ‘es = ...,A, are simple eigen- 
values of matrix X, then f(A,), f(A, oF Oy). where 


om $ a 
k=0 


396 Ch. 11. More About Converg. of Iterat. Processes 


are eigenvalues of the function 


F(X)= = a, X" 


In particular, the numbers A‘, ..., A% are eigenvalues of the 
matrix X*, : 
Theorem 2. The geometric series of matrices 
EXX. HXH., (5) 


where X is a square matrix of order n, converges if and only if 
all the eigenvalues 


M=M(X) (f= 1,2, ...,2) 
of X are located within the unit circle 
. ee ee eee, (6) 
If the series (5) diverges, then X*¥~0 as k> œ. 


Proof. Indeed, since for the appropriate power series 


“8 


xh (7) 


k=0 


the radius of convergence R= 1, and for |x] = 1 (7) diverges, then 
by Theorem I the geometric series (5) converges only when Condi- 
tions (6) hold true. 

If the series (5) diverges, then 


|All (j= 1,2, ..., n) 


whence, assuming for the sake of simplicity that the eigenvalues 
Ay, e., A, are distinct, we have 


X= Sh a’ A] S 
where S is a nonsingular matrix. Therefore 
KESSA AE, a ME] S 


and so X*70 as k— oo. This assertion also holds true for multiple 
eigenvalues 4, (we omit this part of the proof). 


Theorem 3. The modulus of each eigenvalue h,,...,h, of the 
square matrix X does not exceed any canonical norm of it: 


(ASA, ApS leew) 
Proof. Put 


| X =e 


41.2 Cayley-Hamilfon theerem 397 


and consider the matrix 
Y=——X (8) 
where e > 0. Obviously 


IYI =R lx le! 


1 
pel Xll= pF 
Hence (see Sec. 7.10, Theorem 5) the series 
ELY YE VP bs 





converges. 
From this we conclude, by Theorem 2, that the eigenvalues 
Uy, +--+, My Of matrix Y satisfy the inequalities 


lujl<t G=, 22, n) 


But from formula (8) it follows that 
1 


beara (j=, 2,500, ft) 
Consequently 
|AJ<pt+e (j=l, 2,..., n) 
or, because the number e is arbitrary, 
isesi X] G1, 2, .--, n) 
which completes the proof. 


11.2 THE CAYLEY-HAMILTON THEOREM 


Theorem. Every square matrix X is a root of its characteristic 
polynomial; thus, if 
P= + PMO +... tp, 
where (à) = det (AE — X}, then 
p(X) =X" pX" 14... +p, EL = 0 
Proof. Let ali the eigenvalues M, As ..., A, of matrix X, that 
is the roots of the characteristic equation »(4)=—0, be distinct. 


Then X may be reduced to diagonal form with the. aid of some 
nonsingular matrix S: 


KS Wy ask, BAS 


Since p(X) is a particular case of a matrix power series, we have, 
by formula (4) of Sec. 11.1, 


p(X)=S> [yp (A,), (Az), tees p ADS 


398 Ch. 11. More About Converg. of Iferat. Processes 


But, clearly, 
pAa)=0 (QG=1, 2, ..., n) 


p(X) =S- [0, 0, ..., 0] S=0 


If the characteristic equation p(A)=0 has multiple roots, then 
they may be regarded as the limits of noncoincident roots under 
infinitesimal perturbations of the coefficients of the equation [1]. 
As a result, the theorem can be generalized to this case as well. 


and so 


11.3 NECESSARY AND SUFFICIENT CONDITIONS FOR THE 
CONVERGENCE OF THE PROCESS OF ITERATION 
FOR A SYSTEM OF LINEAR EQUATIONS 


Using the eigenvalues of the matrix a=[a,;J, it is possible to 
specify necessary and sufficient conditions for the convergence of 
the iteration process for a linear system 


x=ax+B (1) 


By xy 
[i] = =f 
Bn Xn 


Theorem. For the convergence of the iteration process 
KH = age +B (k=1, 2, ...) (2) 


for any choice of the initial vector x and for any constant term P, 
it is necessary and sufficient that the eigenvalues of the matrix a, 
that is the roots of the characteristic equation 


det (a —KE) =0 


be less than unity in modulus. 


where 


Proof. From formula (2) we get 
x)= (Eata. s. 4a) Bote (3) 


whence it follows that the convergence of the iteration process (2) 
for arbitrary p and x is equivalent to the convergence of the 
geometric series of matrices 


Et+ator+...= È at (4) 


By Theorem 2 of Sec. 11.1, the geometric series (4) converges if 
all eigenvalues A, (j=1, 2, ..., n) of the matrix a satisfy the 


11.3 Conditions for converg. of a sysiem of linear eqs. 399 


inequalities 
[A4 |< 1 (=1, 2, ..., n) (5) 


Since, in that case, «* — 0 as k— oo, from formula (3) it follows 
that the process of iteration converges for arbitrary B and x, 
that is there exists a limit 


lim x”) = x 
k> o 
where x is clearly a solution of system (1). 

If inequalities (5) are not valid, then the series (4) diverges. 
In that case the iteration process will obviously diverge too for a 
certain choice of the initial vector x. 

Thus, for convergence of the process of iteration (2), it is neces- 
sary and sufficient that all the roots A, As ..., A, of the charac- 
teristic equation 


Oy — À Ors Qin 
Hoy Aa — À Qon =0 
And Ane Onan — A 
satisfy the conditions: | >; | ao pS eee yaya) 


Corollary. For convergence of the process of iteration’ (2) it 
is sufficient that 


lal <1 _ (6) 
no matter what .the canonical norm (cf. Sec. 9.1). 


Indeed, we obtain inequalities (5) by virtue of Theorem 3 of 
Sec. 11.1 and on the basis of inequalities (6). 


Note. Consider the linear system 


Ax=b (7) 
where A= [a;] and b= fb, ... b] is a column vector. 
Suppose 
a, 0 0 
D=] 0 aa 0 420 

LO 0 Onn 
To reduce (7) to the special form (1) we ordinarily set 

A=D4+4(A—D) 
whence 


Dx =6—(A—D)x 


400 Ch, 11, More About Converg. of Iterat. Processes 


and since det D=a,,dy, .-. a, 0, then 


£=D 64+D'1(D—A)x 
We can take 
a= D~1(D—A) 


Thus, for the convergence of an ordinary iteration process for the 
linear system (7) given any constant term b and any initial vec- 


tor «, it is necessary and sufficient that all the roots A, Ag, ...,A, 
of the characteristic equation 
det [D~1(D— A)—iE] =0 (8) 


be less than-unity in modulus. Taking advantage of the theorem 
on the determinant of a product of two matrices, equation (8) 
may be transformed as follows: 


det D~? det [(D— A)—AD] =0 


Or 
det [AD -+(A—D)]=0 
That is, 
[aA Aia wee Qin 
Aa Ayuh... Ayn 0 
ns ane annA 


11.4 NECESSARY AND SUFFICIENT CONDITIONS 
FOR THE CONVERGENCE OF THE SEIDEL PROCESS 
FOR A SYSTEM OF LINEAR EQUATIONS 


Given the linear system 
x =x +B (1) 


a 
where a=[a,,] and | - |, consider the Seidel process 


B, | 
i-l n 
xP = D ayxpP + B aysf +B (i=1, 2,..., m; k=1,2, ...) 
j= = 


for an arbitrary initial vector 


is 
KO] ! 
xP 


Set 
i a=B+C 


11.4 Conditions for convergence of Seidel process 401 


where 
0 0 o 0 Xi Oa, Cin | 
B= a1 0 0 0 3 C = 0 Zog eon | 
Chip Obi tant Ola oa À LO Qusa S 
Then the Seidel process can be written in matrix form as follows: 
x) = BeO Crt p (k=1, 2, ...) (2) 


Theorem. For the convergence of the Seidel process (2) for system 
(1) given an arbitrary choice of constant term B und initial vector x, 


it is necessary and sufficient that all the roots M, ..., M, of the 
equation 
u —À Ziz Zin 
det [C_(E—B)yja| MY a Rar |g 
QnA Ong + Onn —h 


be less than unity in modulus. 


Proof. From formula (2) it follows that 


(E—B) x =CHR-OLB (4) 
The matrix E—B is nonsingular, since 
det (E-—B)= 1 
and so (4) can be written as 
x? = (E—B)-1Cx*-) + (E— BY p (5) 


It is then clear that the Seidel process is equivalent to the pro- 
cess Of simple iteration as applied to the linear system 


x = (E—B)-1Cx +(E—B)"'B 


By virtue of the theorem of the preceding section, for convergence 
of the iteration process (5) it is necessary and sufficient that the 


roots M, ..., A, of the characteristic equation 
det [((E—B)+C—AE]=0 . (6) 
satisfy the conditions 
In| << (f=1, 2, ..., n) 


Equation (6) is plainly equivalent to equation (3). 
Note, Let $ 


Ax=6 (7) 


96 9616 


402 Ch, 11. More About Converg. of Iterat. Processes 


Set 
aan O ...0 
pa) 
0 0 a 


In order to apply the Seidel method to (7), we ordinarily write 
it in the form 


Dx=(D—A)x +b 


or . 
x =D (D— A) x +- D`1b (8) 
Set 
A—D=B,+C, 
where 
0 0... 0 0 
B = |°% 0... 0 0 
1 EE Wo Wo oa E o 
An Ano a, n=l 0 
and 
0 Aig Qi, n-1 Fn 
T a 
00 0 0 
Then 
D-1(D—A)=B+C 
where 


B=—D-B, and C=— DC, 


and the triangular matrices B and C effect the partition of the 
matrix of system (8) that is necessary for application of the Seidel 
process. By formula (3), the convergence of the Seidel process for 
system (7) depends on the properties of the roots of the equation 


det [— D71C,--(E + D71B,) A] =0 (9) 
Equation (9) can be replaced by the equivalent equation 
det [(D+ B,)4+C,]=0 
or 
AA Qik wee Gy 
yA Arh ... Ay,| 9 (10) 


AnA AnA... 


11.5 Convergence of Seidel process for a normal system 403 


Thus, for the convergence of the Seidel process for system (7) gi- 
ven an arbitrary constant term b and an arbitrary initial appro- 
ximation x, it is necessary and ‘sufficient that all the roots A; 
of equation (10) satisfy the conditions 


Ajl<1  G=1, 2)... n) 


11.5 CONVERGENCE OF THE SEIDEL PROCESS © 
FOR A NORMAL SYSTEM 


Theorem, For a normal system, the ordinary Seidel process con- 
verges for any choice of the initial vector. 


Proof. Let the linear system 
Ax=b (1) 


be normal, that is, let the matrix A= [a] be symmetric and 
positive definite. 


Set 
A=D4V+¥V* 
where 
fa,, O ... O 
D 0 a 0 
0 0 Onn 
is a diagonal matrix, 
0 0 0 
V = Aa 0 0 
Anı Ang -0 


is a lower triangular matrix, and 


JO Ajase. Qin 
y*— 00... Azn 
0 0- 0 


is an upper triangular matrix, which, because A is symmetric, is 
the transpose of V. We then have 


(D+V+V*)x=6 
whence 
Dx=b0—(V+V*) x" — 


404 Ch. 11, More About Converg. of Iterat. Processes 


and, consequently, 


x=Db— D- (V +y”) x (2) 
where 
L oa | 
ai 
D7} eo 8 
o Deen E 
nn 


According to the foregoing, the Seidel process for system (1) or 
the equivalent system (2) is constructed as follows: 
x” = Db + Be) + Cxte-v (k=1, 2, ...) (3) 
where 
B=— DV and C= — D`!y* 


‘By virtue of the theorem of the preceding section, for the process 
to converge it is necessary and sufficient that all the eigenvalues A 
of the matrix 


=(E—B)"1C 

be Jess than unity in modulus. 
In our case we have 

M=— (E+ DV) D-1yV* = — [D~ (D +V)? DY? = 

=— (D+VV 1 DD-V* = — (D+V)71V* 
Let e be a unit eigenvector of matrix M corresponding to the 
eigenvalue A, that is, 

(D+V)"1V*e = — he 


or 
Vte=—i(D+V)e 
From this, ; 
(Ve, e)=—A[(D+V)e, e] 
and hence 


jo (V*e, e) | 
(De, e)+(Ve, e) 


We introduce the notation 


(De, e) = 2a wleP=o>0 
and i 
- (Ve, e)=a-+iB 


where @ and B are real and i= — 1. 


11.6 Methods for effectively checking convergence conditions 405 


Since the matrix V* is the transpose of V, we obtain 
(Ve, €)=(e, Ve)=(Ve, e)*=a—if 





Therefore 
eo NO 
(o +a) + iB 
and hence 
VELP , 
À = 4 
= yerr ” 
Using the positive definiteness of the matrix A, we get 
(Ae, e)=(De, e)+(Ve, e)+(V*e, e) = 
= 0+ (a+ iB) + (a — ip) =o + 2% >0 
that is, 
O+a>—o (5) 


Furthermore, taking into account the positive nature of o, we 
clearly have 
ota >a 


Thus, the inequality 

o+a >ja] . > (6) 
is always valid, whence for the terms of the fraction (4) we get 

V ota +p > Var +B > 0 
or 
jAl <1 
which is what we set out to prove. 
11.6 METHODS FOR EFFECTIVELY CHECKING 
THE CONDITIONS OF CONVERGENCE 


In order to verify the conditions of the theorems of convergence 
of iteration processes, it is necessary to have effective criteria 


that permit determining whether the roots A, A,, ..., 4, of a 
given algebraic polynomial 

F(A) = pod" + Pye. + Pn (1) 
meet the requirement 

[A [<1 (j=1, 2, ..., n) (2) 


or do not. This problem is settled very simply by using the well- 
known Hurwitz conditions [2]. 


406 Ch. 11. More About Converg. of Herat. Processes 


` 


Theorem (Hurwitz). Let the coefficients p, (k=0, 1, ..., n) of 


the polynomial (1) be real, and 


Po > 90 
and let 


: fP] 0... 00 0 7 
Ps Pa Pa ee 0 


. e. e x op»  ? 


E Gio ea Pa fae 
D Oe 00.. 00 m) 


be a matrix of order n whose rows are extended sequences of the co- 
efficients of polynomial (1), 


Pam-1 Pom—2» eee Pemen 


where pp =0 for k <0 and k >n. Then all roots M, M, -.+, An Of 
(1) have negative real parts 


Rea; <0 (Pel, 2; sean n) 


(which is to say they are located in the lejt half-plane of the com- 
plex plane =a +i) if and only if the principal diagonal minors 
of matrix M are positive; that is 


A,=7p,>0 ) 


PriPo | 
A= 0, 
| PsP a } 








(3) 


A, 1=PrA,-1 > 0 J 
Example 1. For the quadratic trinomial 
PA + p+ Po 
the Hurwitz conditions are 
Po > 9, Pı >0, p >00 


We would like to know when the roots of polynomial (1} satisfy 
Condition (2), that is, whee bey lie in the complex plane A with- 
in the unit circle 


[AJ <1 
Using the linear-iractional function 


ETI 
“Hal 


we can transform the interior of the unit circle [A] <1 into the 


11.6 Methods for effectively checking convergence conditions 407 


left half-plane Rep <0. Our polynomial (1) then becomes 
p (RED) =p (HE) ta (HR) te + iPa= 
=h la WED Hp WE EHD p HD] 
Hence, the roots of (1) are located inside the unit circle if and 
only if the auxiliary polynomial 
F (py) = = [po (Ht 1) + ps (w+ 1) (p—1) +... +p (u— 1)" 


satisfies the Hurwitz conditions (3), and the sign is chosen so that 
the leading coefficient 


+ (ptt... Fp) >0 
Example 2. Consider the quadratic trinomial 
P(A)= + pat g (4) 
where p and g are real. The auxiliary polynomial is of the form 
F(w)=+ [n+ 1? +p (e+1)(H—1) +9 (y—1)] = 
=+ [1 +p) +2 (0—9) u+ (1 —p+4)] 


From the Hurwitz conditions we get 


Eoso] 





+ (1—9) > 0 
+(1—p+q)>0 
Consider the cases: 
(a) q <1, then q >—p—!1 and q >p— l; 
(b) g>1, then g<—p—I1 and q < p—1, which is impossible. 


Consequently, equation (4) has the roots 4,4, which are Jess than 
unity in modulus if and only if 


lpPI<Itq, lg] <1 (5) 


Since for n=2 the characteristic equation of the matrix a is 

of the form 
Xy — À Qis =, 
Oy Xg — À 
or 
A — (cy, +2) A+ deta = 0 

then for the convergence of the appropriate iteration process for 
a system of two equations it is necessary that 


|deta|< 1 


408 Ch. 11. More About Converg. of Iterat. Processes 


The regions of convergence of the process of ordinary iteration 
and the process of Seidel overlap, generally speaking. There are 
cases of linear systems for which the ordinary iteration process 
converges while the Seidel process diverges, and conversely [3}. 


Example 3. Consider the linear system 


x=ax+p ` (6) 


Ee 
c= 
—q p 
where p and g are real. 
The characteristic equation of matrix œ is of the form 


with the matrix 








Bh q | 

—q pak 
or 

(A— p} +g =0 
whence 

M, 2=p iq 


For the convergence of the process 
of ordinary iteration it is necessary 
and sufficient that 


LAST +E <1 
that is 





Pres 


(region A in Fig. 57). 
For the Seidel method, the equation defining convergence is of 
the form 


Fig. 57 


p—h q 
gh oa? 
OT 
MM (2p—G?)h-+ p?=0 (7) 


On the basis of the results of Example 2, for the roots 4, and A, 
of equation (7) to satisfy the conditions 


A <i 0 [Af <i 


it is necessary and sufficient that the following inequalities be 


11.6 Methods for effectively checking convergence conditions 409 


valid: 


[2p—@i<ltp, pci 
whence 


lpi Jg|<1+p 


(region B in Fig. 57). Since the regions A and B partially over- 
lap, it follows that for system (6) one can choose the coefficients 
p and g, firstly, so that the process of iteration converges and the 
Seidel process diverges (for example, p=—0.5, g = 0.6), and, se- 
condly, so that, conversely, the Seidel process converges and the 
iteration process diverges (for example, p=0.5, g=1). 


REFERENCES FOR CHAPTER 11 


[1] V. I. Smirnov, Course of Higher Mathematics, 1933, Chapter VII 
(in Russian). 3 

[2] A. G. Kurosh, Course of Higher Algebra, 1972, Chapter 8 (translated from 
the Russian). i 

[3] V. N Faddeyeva, Computational Methods of Linear Algebra, 1950, Chapter II, 
(in Russian). 


Chapter 12 


FINDING THE EIGENVALUES 
AND EIGENVECTORS OF A MATRIX 


12.1 INTRODUCTORY REMARKS 


In solving theoretical and practical problems, one often finds it 
necessary to determine the eigenvalues of a given matrix A, that 
is, to compute the roots of its characteristic (secular) equation 


det (4 —AE)=0 (1) 


and also -to find the corresponding eigenvectors of A. The second 
problem is simpler because if the roots of the characteristic equ- 
ation are known, then finding the eigenvectors reduces to finding 
nontrivial solutions of certain homogeneous linear systems. We 
therefore begin with the first problem: to compute the roots of the 
characteristic equation (1). 2 

Here, two techniques are chiefly used: (1) expanding the secular 
determinant into a polynomial of degree n: 


D(a) = det (A—XE) 


and then solving the equation D(A)=0 by one of the familiar 
approximate methods (like, say, the Lobachevsky-Graeffe method 
described in Secs. 5.7 to 5.12) and (2) approximating the roots of 
the characteristic equation (mostly the numerically largest ones) 
by the method of iteration without any preliminary expansion of 
the secular determinant. . 

In this chapter we discuss the principal methods of solving the 
given general problem and begin with the expansion of secular de- 
termtnants. 


12.2 EXPANSION OF SECULAR DETERMINANTS 

As is well known, the secular determinant of a matrix A= [a;] - 
is a determinant of the form 

an —À ay, kou ay 

Az Ag — A... a 

D (à) = det (A — àE) = 


12.2 Expansion of secular determinants 411 


Equating this determinant to zero we get the characteristic equ-, 
ation 


D(Ay=0 
If it is required to find all the roots of the characteristic equ- 
ation, then it is desirable, first of all, to compute determinant (1). 
Expanding determinant (1), we get an nth-degree polynomial: 
D(A) = (—1)" [A — 0AT] 4 0,A"-2—.,. +(~1)" 9, ] (2) 
where 


o= > Qaa 
a=l1 
is the sum of all first-order diagonal minors of the matrix A, 


1 Qaa Qag 








0, = 
a < BIA Cep 


is the.sum of all second-order diagonal minors of A, 





Qaa Qag Qay 

os= È [Age Qg Gpr 
a<p<y a. aa 

Ya 1B MYY 


is the sum of all third-order diagonal minors of matrix A, and so 
forth. Finally, 


6, = det A 


It a to see that the number of kth-order diagonal minors 
of A is i 


Gr ae MED. (k=l, 2, 


i en) 


From this we find that direct computation of the coefficients of 
the characteristic polynomial (2) is equivalent to computing 


C14 C24... 408 = 271 


determinants of various orders, which, generally speaking, is a 
problem that is hard to handle when the values of n are large. 
This has given rise to special methods for expanding secular de- 
terminants (the methods of A. N. Krylov, A. M. Danilevsky, 
Leverrier, the method of undetermined coefficients, the method of 
interpolation, and others) (see [1]). Some of these methods will 
be examined in the following sections. 


412 Ch. 42. Finding Eigenvalues and Eigenvectors of a Matrix 


12.3 THE METHOD OF DANILEVSKY 


The essence of the Danilevsky method [!] consists in reducing 
the secular determinant to the so-called Frobenius standard form: 


py--h . Pr Ps moe Pn 
l —^A 0... 0 

Day=! 0 1 —A ... 0 (1) 
0 0 0 Ah 


If we succeed in writing the secular determinant in the form (1), 
then, after expanding it in terms of the elements of the first row, 
we get i 
D (X) = (py—A) (A)? — p, (AVT? + pa (AT. a (1), 
or 

D= (Ipp pA) (2) 

Thus, expanding a secular determinant written in the normal 
form (1) does not present any difficulties. Denote the given ma- 
trix by . 

Qis Qiz -ee Qin 
Oy Azz eee Aen 


A= 


|ant Ang oes Ann 


and the Frobenius matrix similar to it by 


[Pi Pa PEN Pn-1 Pn 
10,.. 0 0 
P= . » 8 č >» č yẹ è» >» e 
k O... 1 0 
Thus 
P=S™AS 


where S is a nonsingular matrix. . 
Since similar matrices have the same characteristic polynomials, 
we have 


det (4—AE) = det (P—AE) (3) 


Thus, to substantiate this method, it will suffice to show how 
matrix P is constructed by proceeding from matrix A. According 
to the Danilevsky method, the transition from matrix A to the 
similar matrix P is effected by means of n—1 similarity trans- 


12.3 Methed ef Danilevsky 413 


formations which successively transform the rows of A, beginning 
with the last, into corresponding rows of P. 
Let us illustrate the beginning of the process. Our purpose is to 
carry the row 
Onana ree Onpn-1Ênn 


into the row 0 0 ... 1 ©. Assuming that a„,n-170, divide all 
elements of the (n—1)th column of A by a,,,_,- Then its nth row 
will take the form 


Anlaa »+- IQan 


‘Then subtract from all the other columns the (n—1)th column 
of the transformed matrix multiplied by the numbers @,,, @y.,--++ Qnn» 
respectively. f l 

We thus obtain a matrix whose last row is of the desired form: 
00... 1 0. The foregoing operations are elementary transforma- 
tions performed on the columns of matrix A. Performing these 
same transformations on the unit matrix, we get the matrix 


ry 0 os 0 0 7 
0 1 ok 0 0 
Maale eee 
Ma-1,1 M a-i, Ma-ı n-1l My n 
L 0 0- 0 l | 
where 
Mas, ¢ for in—l (4) 
and 
1 ; 
Mai, A (4°) 


From this we conclude (see Sec. 7.14) that these operations are 
equivalent to postmultiplying matrix M,., by matrix A; that is, 
the foregoing transformations result in the matrix 


l; bii bis gece by, RTI by, n p 


P AM, 41= =] + © © © © © 2» © y © © B® © © o (5) 


Using the rule of matrix multiplication, we find that the ele- 
ments of B are computed from the following formulas: 


bj = Ay +O; n-dMn a, j for lxiison, jæn—!l, (6) 


Oy n-1 =i, n-1Mn 1, nt for Il<icn (6’) 


414 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


However the matrix B=AM,_, will not be similar to A. To 
have a similarity transformation, it is necessary to postmultiply 
the inverse matrix M74, by the matrix B: 

M;2,4M,, 1 = MrB 

It is readily seen by direct verification that the inverse matrix 

M;2, is of the form 
PP Orsay. AG oO 


E Or l... 0 0 
Ma1= Gay One noes Qn, n—1 ann (7) 
0 0... 0 1 
Let 
MAM,- =C 
Then 
C = M74 B. (8) 


Since it is quite obvious that postmultiplication of matrix M71 
by matrix B does not alter the transformed row of B, matrix C 
is of the form 


r Cy Cie roar ae Cy, n-1i Cin q 
Cor Cog Cy nmi Con 
Cak sarge WR Toese Ale (9) 
Cn -1 1 Cy 2 Cn ~1, awl Cna n 
a 0 l 0 | 
Multiplying matrices M72, (7) and B (5) together, we get 
cy= Oo, for I&i&n—2 (10) 
and l 

Ca-n = De angbgj -for 1Kjeæn (10%) 


Thus, multiplication of M;2, by B only changes the (n—1)th 
row of B, The elements of this row are found from formulas (10) 
and (10’). The resulting matrix C is similar to A and has one 
reduced row. This concludes the first stage of the process. 

Now, if ¢,-1,,-2.0, then similar operations are performed on 
matrix C by taking its (n—2)th row as the principal one. We 
then obtain the matrix 


D = M71, CM,,_, 


with two reduced rows. This matrix is subjected to the same ope- 
rations, and the process is continued until we finally obtain the 


12.3 Method of Danilevsky 415 


Frobenius matrix 
Pa ME nea MaM AMn aoc nee M, 
if, of course, al] the n—1 intermediate transformations are possible. 
The entire process can be arranged in a convenient computatio- 


nal scheme, the formation of which is illustrated by the following 
example. 


Example. Reduce the following matrix to the Frobenius form: 


123 4 
2123 
A=!3 2 1 2 
4321 


Solution. The computations are arranged in Table 25. 
Enter the elements a; (i, /=1, 2, 3, 4) of the given matrix 


and the check sums as =F ay; (i=1, 2, 3, 4) (©) in rows 1 to 
T= 


4 of Table 25. We mark element a,,=2 in the third column 
(marked column). In row 1 we enter the elements of the third row 
of the matrix M,_,=M, computed from formulas (4) and (4’): 











iL eae gy 4 =-2 
m,,=—S#=— 5=—-155, 
Mag =z =5=0 5 

My = —~ 38 5a 05 


Here also (Row I of Table 25) we enter the element 


os Aas. 10 _ se 
Mgs aas 2 5 








which is obtained by a similar device from the check column 
The number —5 must coincide with the sum of the elements of 
Row I that do not enter the check column (after element m,, is 
replaced by —1). For the sake of convenience, we write the number 
—1 alongside the element m,, and separate it from the latter b 
a dash. : 
In rows 5 to 8 of column M7! we enter the third row of the 
matrix M7}, which, by virtue of formula (7), coincides with the 
fourth row of the original matrix A. In the appropriate columns 


416 


TABLE 25 
COMPUTATIONAL SCHEME OF DANILEVSKY’S METHOD 





PAES PRENE Á Ki | — j aj) —- 


NN 
| | 
19 
N owo o 10 mM 
= = 
f | 1] 
10 e) 
| gon = F NA 
9 — 
Ele a { e 
a N “— 
E MNS is 
Sib [=] 
wo 
= 
E wm | Ww 
Sia í 
oO NAN oO = NN 
2 Pee 
=] =N + N NAN 
I || 
oO 
a an re 
| 
z Ta 
mNO tH m Inco 





a fa | 








0.5 














|= 


|— 


0.167 


5 























12.3 Method of Danilevsky 4i7 


of rows 5 to 8 we enter the elements of the matrix 
B=AM, i 


which are computed from the two-term formulas (6) for nonmarked . 
columns and by the one-term formula (6) for the marked column. 
Thus, for the first column we have 
ba =1+3(—2)=—5, 
ba =2+2(—2)=—2, 
b,, =3+1(—2)=1, 
by, =4+2(—2)=0 
and so forth. 
The transformed elements of the third (marked) column are 


obtained by multiplying the initial elements by m,,=0.5. For 
example, i 


Note that the last row of matrix B must have the form 
0010 


For a check, we augment the matrix B with transformations 
by analogous two-term formulas with m,,=—5 corresponding 
elements of the column marked $}. For instance, 

by, = 10+3-(—5) = —5, 
ba = 8+ 2-(—5) =—2, 
bas = 84+ 1-(—5)=3, 
b,,=10+42-(—5)=0 
The results obtained are entered in the $, column in appropriate 


rows. Adding to them the elements of the third column, we have 
the check sums 





we 


Dis = bi; (i= l, 2, 3, 4) 


J= 
for the rows 5-8 (column X). 

The transformation M;1 which was performed on the matrix B 
and which yielded the matrix C = M;'B only alters the third row 
of B, that is, the seventh row of the table. The elements of this 
transformed row 7’ are obtained from formula (10); they are thus 
the sums of the paired products of the elements of column M~: 


27 9618 


418 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


located in rows 5 to 8 by the corresponding elements of each 
of the columns of matrix B. For example, 


Cd (—5)+ 3 (—2) + 2- 1 = —24 


and so on. 
We perform the same transformations on the $ column: 


Cs, =4(—3.5)+3(—1)+2-3.5+1:1=—9 


This way we get a matrix C consisting of rows 5, 6, 7’, 8 with 
check sums È and C is similar to 4 and has one reduced row 8. 
This completes the construction of the first similarity transforma- 
tion C= M;z1AM,. 

Now, taking matrix C as the initial matrix and isolating ele- 
ment ¢,,==—15 (second column), we continue the process as before. 
We thus get the matrix D—M;'CM,, the elements of which are 
located in. rows 9, 10’, 11, 12, which matrix contains two reduced 
rows. Finally, starting from element d,,=6 (first column) and 
transforming the matrix D into a similar matrix, we get the desired 
Frobenius matrix P, whose elements are entered in the rows 13’, 
14, 15, and 16. Checking at each stage in the process is effected 
via columns $ and $y, 

Thus, the Frobenius matrix is 


4 40 56 20 

1 0o 0 ol 
P=|ð 1 0 0 

0 <0. 17-0 


whence the secular determinant reduced to the Frobenius standard 
form will be written as 


4—h 40 56 20 
1 —a 0 0 
D=o 1 —a 0 
0 O° ek 


or 
D (A) = At — 4)? — 40A? —56A — 20 


12.4 EXCEPTIONAL CASES IN THE DANILEVSKY METHOD 


The Danilevsky~process proceeds without any complications if 
the chosen elements are nonzero. We wil] now examine exceptional 
cases when this requirement is not met. 


12.4 Exceptional cases in the Danilevsky method 419 


Suppose that in transforming matrix A into the Frobenius 
matrix P we arrived, after a few Steps, at a matrix of the form 


ldi di; KEF dig wre dı, n-1 din | 


dy dy» $ dar dy, n=1 dan 

zdi dpi dyz kace dyp è dy n-1 den 
D=) 0 0... 1...0 0 
0 0 . 0. 0 0 
L0 0 0 1 0 J 


and it was found that dp, p-1=0. 

It is then impossible to continue the transiormation by the 
Danilevsky method. Two cases are possible here, 

1. Let some element of matrix D to the left of the zeroth ele- 
ment d,,_, be different from zero, that is, dą, ,0, where 
[<k—1. We then move this element to the position of the 
zeroth element d,,_,; that is, we interchange the (k—1)th and 
ith columns of D and simultaneously interchange the (k—1)th 
and Ith rows. It can be proved that the resulting new matrix 
D’ will be similar to the earlier one. We then apply the Dani- 
levsky method to this new matrix. 

2. Suppose d,,=0 (f=1, 2, ..., &—1), then D is of 
the form 





r (D,) x (L) q 
Cil Cig ee Ci, kal Ciko eet Cy git Cin 
pa] e-na Ce-rratit Ce-i ber | Crome it Chon n-ionn | 
0 0 Jes 0 Cop e Cp naa Din z 
0 0 3620 Í I0 0 
0 0 wee Oo al 0 
L (0) (D,) = 
Gra 
P= OE OD: 


In this case the secular determinant det(D—A£) breaks up 
into two determinants: 


det (D—AE) = det (D, —AE) det (D, —2E) 


420 Ch, 12, Finding Eigenvalues and Eigenvectors of a Matrix 


Here, the matrix D, is already reduced to the canonical form of 
Frobenius and so det(D,—A£) is computed at once. It remains 
to apply the Danilevsky method to the matrix D,. 


12.5 COMPUTATION OF EIGENVECTORS 
BY THE DANILEVSKY METHOD 


The method of Danilevsky permits finding the eigenvectors of 
a given matrix A if the eigenvalues are known. Let A be an`. 
eigenvalue of matrix A and, hence, an eigenvalue of the Frobe- 
nius matrix P which is similar to it. 

We find the eigenvector Y= (Yis Ya .-., Yn) of P corresponding 
to the given eigenvalue A: Py=Ay, whence (P—AE)y=0 or 








[oA Pae Ps > Pa ti hm | 
l — O .. 0 Yz 
0 1 —A ... 0 Y, |=9 
En 0 0 we ALG, 
Multiplying the matrices together, we get a system for deter- 
mining the coordinates Y1, Yz, ..-, Yn of the eigenvector y: 
(P14) Ya + Padat +++ + PnYn=9, | 
Yı — Ms =0, 
ns =0, | 0) 
Yn-1— An =0 | 


System (1) is homogeneous. Its values, to within the propor- 
tionality factor, can be found as. follows. Put y,=1. We then 
successively obtain 


Yn-1 =A, 
2 
A= r= 
Thus, the desired eigenvector is 
rà» 
ArT 


y= 


12.6 Method of Krylov 421 


Now denote by x the eigenvector of matrix A corresponding 
to the eigenvalue A, It is clear then that we have 


x =M,_,M,_»- # .M,M,Y 


The transformation M, performed on y yields 


pa 3 Fa 7 
Mir Maie +e Min Yı p> Ma Ye 2 My pYp 
i O 1 ...0 v] |7 a A 
wW=] ae a Yo = l 


0 0...1 ; : : 
Ha Yn ath Ts 1 

Thus, the transformation M, only alters the first coordinate of 
the vector. Similarly, the transformation M, alters only the se- 
cond coordinate of the vector M,y, etc. Repeating this process 
n—l, times, we obtain the desired eigenvector x of matrix” A. 
12.6 THE METHOD OF KRYLOV 


We will now examine a method of expanding a secular deter- 
minant that is due to A. N. Krylov K and is based on an essen- 
tially diferent principle than the method of Danilevsky. 

Let 


DA= det (ME—A) =M + p44... +p, (1) 
be the characteristic polynomial (apart from sign) of the mak 


rix A. By the Cayley-Hamilton theorem (Sec. 11.2), matrix A 
reduces its characteristic polynomial to zero, and so 


A+ pm AIH... +p,E=0 (2) 


Now let us take an arbitrary nonzero vector 


yP 
yo = : 
Yn 


Postmultiplying both members of (2) by y”, we get 
Ary + AMY +... + ppy=0 (3) 
Set 
Ary Ose (k=1, 2, ..., n) (4) 


422 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


Then equation (3) takes the form 


IPH PIT PM +o + Pp =O (5) 
or ` 
a) : yP TaT yi” 
YET? ys? i yP || Pa | tye? (5") 
ye? yo + yh Lp, J yr 
where ye 


m| YP ES 
Eee Vit (k=O, 1, 2,..., n) 


y? 
Hence, the vector equation (5) is equivalent to the system of 
equations 
PUO HPAP +e Hp =P (=l, 2, ..., n) (8) 
from which we can, a ae speaking, determine the unknown 


coefficients pi, Pa = 
Since on the basis of eee (4) 


yy) = Ayo 


(k=1, 2, ey the coordinates y, y®, ..., y of the vector 
y' are successively computed from the formulas 


n 
Da 
vi = pa aiy y 


ue ž apy, (7) 


Thus, AE ne coefficients p, of the characteristic poly- 
nomial (1) by the Krylov method reduces to solving the linear 
system of equations (6), whose coefficients are computed from the 
formulas (7); note that the coordinates of the initial vector 


E 
y» Pae : 


yw | 
are arbitrary. If system (6) has a unique solution, then its roots 
Pi» Pas +-+» Py are the coefficients of the characteristic polyno- 


12.6 Method of Krylov 423 


mial (1). This solution can be found, for example, by the Gauss- 
ian method (Sec. 8.3). If system (6) does not have a unique 
solution, the problem is complicated [I]. In this case it is advi- 
sable to change the initial vector. 

Exampie. Use Krylov’s method to find the characteristic polyno- 
mial of the matrix (see Sec. 12.3) 


A= 


Whe bw 
N -N Ww 
mw 


I 
2 
3 
4 


Solution. We choose the initial vector. 
[i 


0 

(o) — 
y k 
0 


Using formulas (7), we determine the coordinates of the vectors 
yi = Ary (k=1, 2, 3, 4) 


Thus; we have 


Caa ar poh 
2 1 2 3llol l2 
D — (0) — = 
a ae : ol Tlp 
4 3 2 iloj l4 
1 2 3 47717 130 
21 2 31/2] l2 
D Ay®— = 
ses ala a es 4 3)— | 18° 
43 2 1114} 20! 
1 2 3 477307 7208 
2 1 2 3|l22| | 178 
B) a AyD — = 
iil E oa l i8|~ | 192 |" 
4 3 2 1/120] 1242 
1 2 3 472087 T2108 
2 1 2 31/1781 | 1704 
yO = Ay? = = 
ay 3 2 1 2I/\ 192 1656 
l4 3 2 111242] 1199 


424 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


We set up the system (6): 








y® y? yP yo p|] Py 
YP YP yP OP || Ps y? | 
YP yP yP yO ll p| ty 
YP YP yP yP LPs yP 
which in our case has the form ; 
208 30 1 1!lfp T2108 
178 22 2 0ļ|p,|_ __| 1704 
192 18 3 0||p 1656 
242 20 4 Oll p, 1992 


whence 

208p, +30p, + Pa + Pa = —2108, | 
178p, + 22p, + 2p, = —1704, 
192p, + 18p, + 3p, = — 1656, 
242p, + 20p, + 4p, = — 1992 


Solving this system, we obtain 
P= —4, pp=—40, pp=—56, p= —20 
Hence 
det (AE — A) = M — 4M — 40)? — 56A — 20 


which coincides with the result obtained by the Danilevsky meth- 
od (see Sec. 12.3). 


127. COMPUTATION OF EIGENVECTORS 
BY THE KRYLOV METHOD 


The method of A. N. Krylov makes it possible, in simple fa- 
shion, to find the corresponding eigenvectors [1]. 

For the sake of simplicity, we confine ourselves to the case when 
the characteristic polynomial 


D= pA... ED, (1) 


has distinct roots M, M» ..., A, Let us assume that the coeffici- 
ents of the polynomial (1) and its roots have been determined. It 
is required to find the eigenvectors x®, x, ...,¥™® correspon- 
ding to the eigenvalues M, M, ..., Àp, respectively. 

Let y, yP = Ay, .., , y”- D= A"-1y be the vectors used 
in Krylov’s method for finding the coefficients p; (i= 1, 2, ..., A)’ 
Decomposing the vector y into the eigenvectors x (i= 1, 2,..., n) 


12.7 Computation of eigenvectors by fhe Krylov method 425 


we have 
y =o He 4 ao Cy ™ (2) 


where c; (i=1, D ee n) are certain numerical coefficients. From 
this, taking into consideration that 


Ax) = h, 
Arx À = MD (=d Dent 
we get . 
YY =c he DHMO +... +6,h,4, : 
be tg, Oe a Ue, a tee ee gee iat aes (3) 
POTD =c Mt) + Agta) + ‘h Saas +e, Mig 
Let 
Pz (A) SAME gy AP ee Onan (4) 
(i=1, 2, ..., n) be an arbitrary set of polynomials. Forming a 
linear combination of the vectors y@~), y""®, ..., y with the 
coefficients 1, qi- -+> n-p: we find, by relations (2) and (3), 
VOM AIO Ho Ani VO = 5 
= 6,9; MEL +649; (hg) 8 + 0 +64; aa O 
If we put 
í D(a A 
p= E i= 1, 2 ee m) (6) 
then obviously 
p;(4j)=0 for ij 
and 
pA) =D (hj) 40 
Formula (5) then becomes 
C1; (A;) x) = yr +g p TP + San? + Gy 1 
(i=1, 2, ..., n) (7) 
Thus, if c;40,-then the resulting linear combination of vectors 
yr), yo, ..., y yields the eigenvector x to within a 


numerical factor. The coefficients qi (i=l, 2, ..., n— 1) can 
easily be found from the Horner scheme 


hi=l, \ 
lji = Aija, HP; 


426 Ch. 42. Finding Eigenvalues and Eigenvectors of a Matrix 


12.8 LEVERRIER’S METHOD 


This method of expanding a secular.determinant is based on 
the Newtonian formulas [3] for the sums of powers of the roots 
of an algebraic equation. 


Let 
det (AB — A)=A"-+ pA TI. H , (1) 
be the characteristic polynomial of a given matrix A=[a,,] and 
let A, A, -.-, A, be the complete set of its roots, where each 


root is repeated as many times as its multiplicity. 
Set ° 


Sp AR HM LL EE (k=0, 1, 2, ..., n) 
Then for k&n the Newtonian formulas [3] hold true: 
Spt PiSk-1 H +++ + Pp-iS1 = — AP (k=1, 2, ..., m) (2) 
whence 


aot a | 
P7 (S, + P51), 
: (3) 


1 3 
Pa = — yg (Sh + Pisa -1 + bes + Pn—151 | 


If the sums s,, s,, ..., S, are known, then by means of formulas 


(3) we can, step by step, determine the coefficients p,, Pa ...» Pn 
of the characteristic polynomial (1). 

The sums Sı, Sa» ..., S, are computed as follows: for s, we have 
(Sec. 10.12) 


S,=A, +h,+.-.- +A, =trA 
that is, 
s= È au (4) 


‘Then, as we know (Sec. 11.1), Ai, ME, ..., A# are the eigenvalues 
of the matrix A*. Therefore 


Sp=M+A+...+4n= tr AF 


That is, if 
A‘ = [a] 
then 
Sg = D ap (5) 


i=l 


The powers Ą* = 4*~1A are found by direct multiplication. 


12.8 Leverrier’s method 427 


Thus, the scheme for expanding a secular determinant by the 
Leverrier method is extremely simple, namely: first compute 
Ak (k=1, 2, ..., n), the powers of the given matrix A, then find 
the corresponding s,, which are the sums of the elements of the 
principal diagonals of the matrices A* and, finally, determine from 
formulas (3) the desired coefficients p; (i=1, 2, ..., n). 

The Leverrier method is extremely laborious because one has to 
compute high powers of the given matrix. Its merit lies in the 
simple computational scheme and the absence of exceptional cases. 


Example. Use the Leverrier method to expand -the characteristic 
determinant of the matrix (see Sec. 12.3 

12 3 4 

1 2 3 

2 1 2 

3.2 «41 


Solution. Form the powers A* (k=2, 3, 4) of the matrix A. We 
have 


123 477123 47 [30 22 18 20 
,_ {2123/2123 22 18 16 18 
V=)301 211321 2/—|18 16 18 2l 
4321314321 20 18 22 30 
r30 22 18 2090F], 2.3 4 208 178 192.242 
„> [20 18 16 18||2 123 178 148 154 192 
A=] 18 16 1822113 21 2|™| 192 154 148 178 |’ 
20 18 22 30]|4 32 1] {242 192 178 208 
208 178 192 24277123 4 2108 1704 1656 1992 
„|178 148 154 192 || 212 3 1704 1388 1368 1656 
A*= | 192 154 148 1781/3 2 1 2|—| 1656 1368 1388 1704 
242 192 178 208) 1432 1 1992 1656 1704 2108 


Note that it was not necessary to compute A* compietely, it 
being sufficient to find only the elements of the principal diagonal 
of the matrix. 

Whence 


s=trA=1+1+141=4, 
$= tr A? = 30+ 18 + 18 -4-30 = 96, 
' s= tr A? = 208 + 148 + 148 + 208 = 712, 
s, = tr A*= 2108 + 1388 + 1388 + 2108 = 6992 


428 Ch. 12. Finding Eigenvalues and Eigenvectors ef a Matrix 


Hence, from formulas (3), we get 


p= —5,=—4, 
l l 
p=— tls tps) — + (90—44) = —40, 
s ; 
EE EU ET E E E = —56, 


1 
Pa = — g (Sat P183 + PS2 T Pasi = 

= — + (6992— 4712—40- 9—56- 4) = —20 
Thus, we obtain the already familiar result (see Sec. 12.3): 
A—l —2 —3 4 
—2~—1 —2 —3 
—3 —2 à —l —2 
—4 —3 —2 hl 


= M— 44° — 402 — 56A — 20 


12.9 ON THE METHOD OF UNDETERMINED COEFFICIENTS 


A secular determinant can also be expanded by finding a suffi- 
ciently large number of its numerical values. 
Let 
D(A) =A HPNA. + Pn (1) 
be the secular determinant of the matrix A, that is, 
D (à) = det (AE — A) 


If in (1) we successively put A=0, 1, 2, ..., n—l1, then for 
the coefficients p; (¿=1, 2, ..., n) we get the following system 
of linear equations 
= D (0), 
lr p i+... pa = D (I), 
2”+ p, 27i n.a Pa = D (2), (2) 


(1—1) +p, (1—14. = + pa = D(n—l) 


= 


whence 


Pit Pat +++ + Pa 1= D(1)—D (0)—1, ) 

2771p, + 2°-8p, +... + 2p,-1 = D(2)—D(0)—2”, |, 

Karane TSE stoke cae el oe AL Oe Pes ele oe, Gao BEG t (3) 

(n— br? py + (a — 1) 2 De +... +(n—1)p,-1= | f 
= D(n—1)—D(0)—( (n—1). J 


12.10 Different methods of expanding a secular determinant 429 


and 
py == D (0) = det (—A) 


From system (3) we can determine the coefficients. p; (i=1, 
2, ..., n) of the characteristic polynomial (1). 
Introducing the matrix 


a eee Mae 1 i 
n=1 rea 
C, = 2 2 
L(n—1)"-2 (n—1)"-? a=] 
and the vectors 
D(1)—D(0)—1" M 
p=|P@—P0)—2 d p 
D(in—1)—D(0)—(n—1}] ` a 
we can write system (3) in the form of a matrix equation: 
C,P=D E 
whence 
o (5) 


Note that the inverse matrix Cz? depends only on the order n 
of the secular determinant and can be found beforehand if one 
has to expand large numbers of secular determinants of the same 
order. 

Thus, the use of this method reduces to computing the nume- 
rica] determinants 


D(k)=det (RE--A)  (k=0, 1, 2, ..., n—1) 


and to finding the solution of the standard linear system (4). 


12.10 A COMPARISON OF DIFFERENT METHODS 
, OF EXPANDING A SECULAR DETERMINANT 


An indication of the relative effectiveness of various methods 
of expanding a secular determinant is given in Table 26 [4], which 
states the number of operations required by each method depending 
on the order of the determinant. 

From this table it is seen that the best method for expanding 
secular determinants of order higher than fifth is, from the view- 
point of number of operations, the Danilevsky method. 


430 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


12.11 FINDING THE NUMERICALLY LARGEST EIGENVALUE 
OF A MATRIX AND THE CORRESPONDING EIGENVECTOR 


Suppose we have the characteristic equation 
det (A—AE)=0 


The roots of this equation M, Ap» ..., 4, are the eigenvalues of 
the matrix A. Suppose that to them correspond the linearly indepen- 
dent eigenvectors xY, x, ..., x. We will. now give certain 
iteration methods for computing the numerically largest eigen- 
value of matrix A that do not require expanding its secular 
determinant. 

Case 1. Among the eigenvalues of matrix A there is one which 
is largest in modulus. For the sake of definiteness, let us assume 
` that 


[Px [> [re] >] Ag]. STA, (1) 


so that the numerically largest eigenvalue is the first one. It is 
obvious that for a real matrix the numerically largest eigenvalue 
A, is real. We note that such is the case if the matrix A is real 
and its elements are positive (Sec. 10,16, Perron’s theorem). 


TABLE 26 


THE NUMBER OF OPERATIONS USED BY VARIOUS METHODS 
IN EXPANDING A SECULAR DETERMINANT DEPENDING ON ITS ORDER 





4 jg 
215 
oO J 
Method zoe, 
a] l 
uf wt 
hice Seg 
ar 2 
SSj=4j A) wn] Alu a a a n 
amje | l { l ! 1 i l 
Zjedz Eja = < = < z < 


Direct expansion 12] 10] 60{ 46] 320} 238 | 13 692| 10 078/986 4001725 758 


Danilevsky 14) 12} 42] 36] 92) 80 282 252 632 576 
Krylov 67 | 38] 179| 118 | 389!280| 1287| 1022| 3209) 2688 
Leverrier 41| 27) 1531114] 414; 330; 1791| 1533| 5228) 4644 
Undetermined coeffi-| 67) 41)171]116|364)265} 1189) 945) 2966; 2481 
‘cients 


Interpolation formula | 46| 38 | 125 | 102 | 279 | 230 972) 826} 2525) 2202 
(see Sec. 14,23) 


12.41 Finding numerically largest eigenvalue of a matrix 434 

We now give an approximate method for computing the root A,. 
Take an arbitrary vector y and decompose it into the eigenvectors 
x) of A: 


ce) 


ne 


y= 


n 
we 


i 


where c; (j=1, 2, ..., n) are constant coefficients. Operating on 
vector y by the matrix A we have 


Ay = X c Axo 
j=l 


whence, since x” is an eigenvector of the transformation A, that 
is, AX =hix, we get 


n 
= 24 je? 


We call Ay an iteration of the vector y. 
Forming the iterations Ay, A’y, ..., A”y in succession, we find 


n 


A”y= x Ce ae (2) 
J= 


(mth iteration). 
In the space E, = {9} choose a basis €, €, ..., @, (not neces- 
sarily the unit basis). Let f 


Aty=y™  (m=1, 2, 3, ...) 


ae 
yr = 
ym i 


where y{™ (i=1, 2, ..., n) are the coordinates of the vector y™ 
in the chosen basis. 

Decomposing the eigenvectors x‘) into the vectors of the basis, 
we have 


and 


n 
KP = x Xie (3) 


whence, substituting (3) into (2), we obtain 


n - n 


y™= > CAT x % jj; 


i=l 


432 Ch. 12. Finding Eigenvalues and Eigenvecters of a Matrix 


or, changing the order of summation, 


yma 2i e; p> CAAT (4) 


The coefficient of e; is the ith coordinate of the vector y”. 
We can thus write 





yi? = 2 Cxi NF (4’) 
Similarly 
ye = Sopp (8) 
Dividing the second sum by the first, we get 
gP etn T+. tb Cntin An (5) 
yl” XAT +... + ntinkn 


Suppose that c,40 and xn 0. This can be achieved by appro- 
priate choice of the initial vector y and the basis (e,, e,, |.., €n) 
Transform expression (5) as follows: 





CoXjig f Ag \”+1 CX; 1 \ mti 
1 ore ($2) aXin ( x) 
yt? $ Oki \M LUONEET Cixi \ Aa 
amy 4 CoXig [À CX, An \® 
yi 1] zri? (i) ee nvin (32) 
t x Aa Ta Cxi \ Aa 


From this, passing to the limit as m — oo and taking into account 
inequality (1), we obtain 


lim #— = 6 
Jim a= hh (6) 


(since lim (j4)"=0 for j > 1) or, approximately, 
m+ o 1 fer 
A Paa (ti=l, 2, n) (7) 


n=O) 


By taking a sufficiently large iteration number m, we can deter- 
mine from (7) the largest (in modulus) root 4, of the characteristiç 


and, more exactly, 


12.11 Finding numerically largest eigenvalue of a matrix 433 


equation of the given matrix A to any degree of accuracy. To find 
this root one can use any coordinate of the vector y'™; in parti- 
cular, one can take the arithmetic mean of the corresponding 
ratios. 


Note 1. In exceptional cases when the initial vector y is not 
aptly chosen, formula (6) may not yield the required root or may 


+1) 


even be meaningless, which is to say the limit of the ratio 4 





my. 
may not exist. This is readily seen from the “oscillating” values 
of the ratio. In such cases one should try a different initial vector. 


Note 2. To accelerate the convergence of the iteration process 
(6) it is sometimes advantageous to form the sequence of matrices 


A= A.A, 
At= AP. AP, 
A’ = As. As, 


A? = A*t. Art-s 


from which we find 





yo = Aty 
and 
yer) == Ay™ 
where m=2*, Then we assume, as usual, 
jm) 7 
Ay & ro (i=1,2,..., n) 


The vector y™=A7y is approximately the eigenvector of the 
matrix A corresponding to the eigenvalue ^. 
Indeed, from formula (2) we have 


n 
ATY = chx + pa CNPP 


where x (j=1, 2, ..., n) are the eigenvectors of matrix A. 


From this, 
Te A; N” 
Any =op (xo +h a (3 ) xo 


Since (y —o as m— œ (j >l), for a sufficiently large m 
1 
we will have, to any degree of accuracy, 
Aty = chy x ® 


OR 9616 


434 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


Thus, A”y differs from the eigenvector x solely by a numerical 
factor and, consequently, is also an eigenvector corresponding to 
the same eigenvalue A,. i 


Exampie. Find the largest eigenvalue of the matrix 


4 1 0 
A = l 2 l (8) 
0O l 1I 


and the eigenvector corresponding to it. — 
Solution. Choose the initial vector 
l 
y=|! 
l 
Form Table 27. 


TABLE 27 
COMPUTING THE FIRST EIGENVALUE 


Aty Ay Aty Aly 


























Aty | A’y 





Alby 


17-5 | 24] 111 | 504 | 2268 | 10161 | 45433) 202833 | 905238 | 4038939 
Py 4 7] 15 60 | 252 | 1089 | 4779)21141; 93906 | 417987 | 1862460 
1] 2 6 2i 8l 333 | 1422| 6201 27342 121248 539235 





Stopping with the iterations Æ’y =y® and Ay =y9®, we get 
the values 
y9” 4038939 


“> a = 4-482, 
(10) 

yz __ 1862460 _ 

y® ~~ 417987 = 4.456, 
2 
(10) 

yy _ 539235 

“yo 120248 E 


Hence, we cañ take, approximately, 


M = 5 (4.462 + 4.456 + 4.447) = 4.455 œ 4.46 


For the first eigenvector of matrix A we can take 
4038939 
Ay = 1862460 
539235 


12.11 Finding numerically largest eigenvalue of a matrix 435 


Normalizing it, we finally obtain 


0.90 
xn — | 0.42 
0.12 


Case 2. The largest, in modulus, eigenvalue of matrix A is 
multiple. 
Let 


and 
al> fàl for k>s 
From formula (5) we have 


m+. m+1 M+1 m+1 

yE D xpi ee He stish bh es41%i, stiks ++. +On¥inkn 
m ~ m m m m ya 

yi Cixi + +e H CsXishi + C5 4181, s+1hsti t +. H CnXinhn 
Aspi \7+1 j hy \ 7+1 
Cxi t eee + eskig TCs 41%, 941 ( M Harken in Ta 
PA EE A N 
1— 


A m An \™ 
Cixi -ee HCstis FCs +i, sta ( a ) Feee entin (32) 








whence, if ¢\x¥4-+...+¢,%;,540 and taking into account that 


(F4)"—-0 as m—oo and k>s 
1 


we get 
(m+1) 


“i She (i=l, 2,..., n) 


E yg) hear \™ 
a Cd +0(( Ay ) ) 
Thus, the foregoing method for computing A, is applicable in 


this case as well. 
As before, 


lim 


m> @ 





or, more exactly, 





yim= Ary 


is one of the approximate eigenvectors of matrix A corresponding 
to the eigenvalue ^. Generally speaking, by changing the initial 
vector y, we obtain a different linearly independent vector of 
matrix A. Note that in this case there is no guarantee that our 
technique will determine the entire set of linearly independent 
eigenvectors of A for the eigenvalue A,. 

For Cases 1 and 2 we can offer a faster iteration process for 
finding the numerically largest eigenvalue 4, of matrix A, namely: 


436 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


form the sequence of matrices 


A, At, As, A8, ..., AF 
As we know (see Sec. 10.12), 
$ A=trA 
Similarly, a 
3 m= tr A” 


i=l 


where m= 2*. Confining ourselves to Case 1 for the sake of simpli- 
city, we have . 


ee er HAMAT fı HEV + + (3 "| =tr A” 


whence 








1 
ho m An m m_m rA 
apte +) yea 
As m— oo we get 
(a |= lim Y tr Am 
or 
[Ay | 9/ tr A™ 
where m is sufficiently great (m>>n). 
In order to avoid extraction of high-degree roots, we can find 
Amtt— Ang 
Then 
Att Ve | Am = tr Am 
and 
AP LAM | HAT tr A” 
whence, taking into account the smallness of |^], ..., [Ap] as 
compared with |à], we obtain 
A, =» tr A”+i/tr A 


‘12.12 THE METHOD OF SCALAR PRODUCTS FOR FINDING 
THE FIRST EIGENVALUE OF A REAL MATRIX 
A somewhat different iteration process (and at times a more 


advantageous one) can be given for finding the first eigenvalue A, 
of a real matrix A. This method is based on the formation of 


` 42.42 Finding first eigenvalue of real matrix 437 


scalar products 
(Aty,, A'Y) and (A*-1y,, A’*y,) 


(k=1, 2, ...), where A’ is the transpose of A and y,.is the 
initial vector chosen in some manner. 

Let us now take up the method itself. 

Suppose A is a real matrix and M, Às» ..., 4, are its eigen- 
values which are assumed to be distinct, and 


|] > [A]... Ian. 


We take some nonzero vector y, and with the aid of matrix A 
construct a sequence of iterations 


Ve = AYk-1 (k=1, 2, ...) (1) 


For the vector y, we also form, using the transpose A’, a second 
sequence of iterations 


Y= A Yra (k=l, 2, ...) (2) 


where Y=. 

By Theorem 1 of Sec. 10.16, in the space E„ we choose two 
proper bases {x;} and {xj} for the matrices A and A’, respecti- 
vely, satisfying the conditions of biorthonormalization: 


(Xo x)= òy (3) 


where Ax;=),x,; and A’x;=)j x; (i, = 2, ..., n). Denote the 
coordinates of the vector’ Y, in the basis ER by Ai --+, Bas 
and in the basis {xj} by bı, ..., bn, that is, 


YoUHyX, +... +4,%, and Yo = Ox, +... t brn 
whence 
/ 


I= A= ai adia; (4) 
and 


Y= Ary = È oaj *i (k=1, 2, ...) (4’) 
Form the scalar product 
(Ir Ya) = (A*y,, A’ty,) = (Yo Ay.) E (3 ne 3.09%) 


From this, by virtue of the orthonormalization condition, we find 


(Ye It) = È abi nt = a bN Habi Mt H n Habi (B) 


438 Ch. $2. Finding Eigenvalues and Eigenvectors of a Matrix 


Similarly ` ; 
(Yri Ye) = OIM + a DN +... abART (6) 
Hence, for aby £0 we have 
Ave, Ye) o bM Hab t Hanba a O n 
( +1) :) a. baa? ab ATIL... -ta pry 2k 1 Ii 
Yk- Vk wy Ay tabr + +. + anbadn 
Thus 


(yp, ya) (Akys, A’*y,) 
eee SE SE Vori Do 7 
^ (Yr-1 y) (A®-1y,, A*yo) 7) 


This method is especially convenient for a symmetric matrix A, 
since then A’=A, and we simply have 


mw (ARV, A¥ Yo) 
As gi (AF=1y,, Aky,) (8) 


and so we only have to construct one sequence y= A*y, 
(k=1, 2, ...). 


Example. Use the method of scalar products to find the largest 
eigenvalue of the matrix (see Sec. 12.11) 


4 1 0 
A=|] 2 1 
0 1 1 


Solution. Since the matrix A is symmetric, it suffices to construct 
only one sequence of iterations Aty, (k=1, 2,...). Taking 


-i 


for the initial vector we can use the results of Tabie 27. For 
instance, when k=5 and k=6 we have 


2268 10161 
Aty =] 1089| and Aty, =| 4779 
333 1422 
whence x 
(A®y,, A®y,) = 2268. 10,161 + 1089-4779 +333. 1422 = 28,723,005 
and 


(A*y,, A®y,) = 10,1612 +4779 + 1422? = 128,106,846 


12.13 Finding second eigenvalue and second eigenvector 439 


And so 

es (Abyo, A8yo) at 128, 106,846 
1 (Ayo, ASVo) 28,723,005 
which coincides (within the digits written) with the value found 
earlier with the aid of Ay, (see Sec. 12.11). 


Note. The method for finding the numerically largest root of 
a characteristic equation (Sec. 12.11) may be used to find the 
numerically largest root of an algebraic equation: 


4 px t+... +p,=0 (9) 


Indeed, equation (9), as may be readily verified directly, is the 
secular equation of the matrix (cf. Sec. 12.3, Frobenius matrix) 


—Py —Po eee Phi —P,} 
1 0... 0 0 


A = 4.46 


P= > 8 òè © © č ù% >» © © F > +% 
LO 0 1 0 
which means that equation (9) is equivalent to the equation 
det (c<P—E)=0 


If (9) does not have zero roots, then, in analogous fashion, we 
can determine the smallest (in modulus) root of the equation, 


namely for p, 0, assuming tsy, we obtain 


n y Panzi m aee 
Paty +... +0 (10) 


The reciprocal of the numerically largest root of (10) will ob- 
viously give us the numerically smallest root of equation (9). 


12.13 FINDING THE SECOND EIGENVALUE OF A MATRIX 
AND THE SECOND EIGENVECTOR 

Suppose the eigenvalues A; (j=1, 2, ..., n) of matrix A are 

such that 
[Aa] >> [Ag] > [Ag] o> --- al (1) 

that is, there are two distinct numerically largest eigenvalues A, 
and A, of matrix A. In this case, we can use a device similar 
to the one discussed above (Sec. 12.11) to approximate the second 
eigenvalue A, and the eigenvector « corresponding to it. 

From formula (2) of Sec. 12.11 we have 


Any = cM) + oA 4... HAER (2) 


440° Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


and 
Antiy = cA x) H eA O o.. +e arttgm™ = (3) 
Eliminate from (2) and (3) the terms containing M. To do 
this, subtract from (3) the equation (2) multiplied by 4, to get 
Amy MAPY = CMP (hy —Aq) P+... OME (Ag— Ay) #™ (4) 
For the sake of brevity, we introduce the notation 
A, A*y = A™+ly— Ay (5) 


We will call expression (5) the A-difference of A”y. If c, 0, 
then, clearly, the first term in the right member of (4) is its 
principal term as m— œœ and we have the approximate equation 





Ay, A™Y = Chy (Ag — Ay) e (6) 
whence 
Ay, ATIY a CATE (Ap — Ay) P (7) 
Let 
ye 
(m) 
Amy = y™ = ? 
y J 
From formulas (6) and (7) we derive 
Ay yg yea yf™ : 
MEET a CHEB nn) BD 


Using formula (8) we can approximately compute the second 
eigenvalue A,. Note that in practical situations it is sometimes 
better (because of loss of accuracy when subtracting nearly equal 
numbers) to take a smaller iteration number & for determining A, 
than the iteration number m for determining ^; in other words, 
it is advisable to set 


. RDO a yo 

hi Em) (9) 
where k is the smallest of the numbers for which à, begins to 
dominate the subsequent eigenvalues. Generally speaking, formula (9) 
yields rough values of A,. It will be seen that if the moduli of 
all the eigenvalues are distinct, then by means of formulas simi- 
lar to (9) one can also compute the remaining eigenvalues of 
a given matrix. However, the results of the computations will be 
still less-reliable. 


12.13 Finding second eigenvalue and second eigenvector 441 


As for the eigenvector x, we can, as follows from (6), put 
x? = Ay y (10) 


There is also an extension of this method to the case of mul- 
tiple roots of a characteristic equation [1]. 


Example. Determine the subsequent eigenvalues and eigenvectors 
of the matrix (see Sec. 12.11, Example) 


ES 
A=/1 2 1 


01 1} 


Solution. To find the second eigenvalue we take k=8. We have 
(see Table 27) 


A’y | Aty | A’y 









45433 202833 905238 
21141 93906 417987 
6201 27342 ° 121248 









Form the -differences using the formula 
Anyi=ylP?—dyi? (i=1, 2, 3) 


where y% = Ajy. For each of the columns, A, assumes a value: 
A = 4.462, A, = 4.456, A, = 4.447 (Table 28). 


TABLE 28 
COMPUTING THE SECOND EIGENVALUE 





202833 202722 11i 905238 905041 197 
93906 94204 —298 417987 418445 —458 
27342 27576 —234 121248 121590 —342 


From this we get 











(8) 458 

Awl 1971,78, Su a S = 1.54, 
Any 11 Ay Ye 

Ans 342 _ 1 yg 

An, y? 234 


442 Ch. 12, Finding Eigenvalues and Eigenvectors of a Matrix 
And so, we have approximately, 
=f (1.784 1.544 1.46) a 1.59 


For the second eigenvector we can take 


197 
Ay, A’y= —458 
—342 


Normalizing this vector, we obtain 


f 0.33 
g = —0.76 
—0.56 


Since matrix A is symmetric, the vectors x (Sec. 12.11) and x” 
must be mutually orthogonal. Verification yields 


(x, x) =0,90-0.33 + 0.42-(—0.76) +0.12-(—0.56) = 0.09 


whence the angle (x, x')—= 85°, which is rather inaccurate. 
The third eigenvalue 4, is found from the trace of A: 


Ay typ Agee tr A=44+241=7 
As =7—4.46— 1.56 œ 0.95 


xi) 
x= J 
( 
i x 


may be computed from the orthogonality conditions: 


0.90 4-0.42x® 40.12 =O, | 
0.33x® +- (—0.76) x + (—0.56) x = 0 


whence 


The eigenvector 


whence 
x = x2 x 
0.42 0.12 0.12 0.90) $0.90 0.42] - 
—0,76 eo —6.56 0.33 0.33 —0.76 
or 
xi x x? 


—0, 144 0.539 ~ —0.818 


12.14 Method of exhaustion 443 


Normalizing, we finally get 


—0.14 
r= 0.53 
—0.8] 


12.14 THE METHOD OF EXHAUSTION 


There is yet another method, called the method of exhaustion [1], 
for determining the second eigenvalue of a matrix and the eigen- 
vector belonging to it. 

Suppose matrix A=[a,,] is real, has distinct eigenvalues ^, 
Ags +++, A, and 


Jaz | > [Ag] > [Ag] >. + I Aal 


Besides A, we consider the matrix 
A = A~ XXi (1) 
where A, is the first eigenvalue of A, 
Vx, q 


Xar 
X= 
L“ 


is the corresponding eigenvector of A regarded as a column ma- 
trix, and 


ni j 


X= [kn Mar ++ Xa] 


is the eigenvector, corresponding to M, of the transpose A’ which 
is regarded as a row matrix; the vectors X, and X, are normali- 
zed so that their scalar product is equal to unity: 


(X,, X) =X, S= p> taka = l (2) 


We assume A, and X, and X; to be known. 
In expanded form, the matrix A, is written as 


Aua ig ee Ay | Xi 

Qa Aa +++ Oy Xel y y 
Gece fs hy : [kitz + Xm) = 

Ons ane Onn ~- Xni 


4aé Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


Gy, ig +++ Oin Xii Xyz e+ Kim 
Qa Qz +++ Qg Xai Xaa e> Xan 
= al eer (1°) 
Qni ane aes ann Mati Xn1X21 sre MaiXnr 
We will prove that the eigenvectors X;(j=1,2,..., n) of ma- 


trix A are also the eigenvectors of A, and the corresponding ei- 
genvalues are preserved, with the exception of A, in place of which 
a zero eigenvalue appears. 

‘Indeed, using the associativity of matrix: multiplication and the 
normalization condition (2), we have 


A,X = AX, — A (X,X,X, = AX, — AX, (X 1X1) = 4X AX, = 0 
or 
A,X,=0X, 
and, hence, zero is an eigenvalue of the matrix A,. 
Then, for j > 1, taking into account that 

(X; X4")=X,X,=0 (j=2, ..., n) 

(see Sec. 10.16, Theorem 1), we obtain 
(j=2, ..., n) 

Thus, a, is the numerically largest eigenvalue of A,, and so T 
can make use of the earljer indicated methods (Secs. 12.11 an 
12.12) to determine A, and the associated eigenvector X,. this 
technique is called the method of exhaustion. For example, procee- 


ding from an arbitrary vector y,, we can compute A, from the 
formula 


ATI); ; 
he, a ATP ode i=1,2,...,7 
Tat), | ) 


X,~cAPy, (C0) 


We will now show that to find the iterations ATy, (m= 1, 2,...) 
one can use the ae 


Also 


ATV, = AMY. — ATX, Xi Yo (3) 
which dispenses with direct iteration of the matrix A,. 
Indeed, let the eigenvectors X, and X,(i=1, 2, ..., n) of ma- 


trix A and its transpose A’ satisty the conditions of bierthonor- 
malization (Sec. 10.16, Theorem 2) 


where Sn is the Kronecker delta. We then have the bilinear ex- 


12.15 Eigenvalues / vectors of symmetric matrix 445 


pansion of A i : 
A=AAX,X,44,X,X,+.-..+4,X,X5 (4) 


whence : 
A, = Å — AXX =A, X Xt... +A,X Xa (5) 
Since 
AX; =1PX, (i=1, 2, ..., n) 
then, by premultiplying (4) by A™~1, we get 
A” = A"X,X,+A*®X,X,+... AX, Xr = 
O =MXXAMX X + -HAX XG (6) 
Similarly, taking into consideration that 
ATX, = AT (A,X,)=0 
and 
EEK, (i= 2, 3, ..., 0) 
we obtain 
Ag = ATX,X3 4+... ATX, X =M X, X,+... HAIX, X, (7) 
a premultiplying (5) by A%-!. From formulas (6) and (7) fol- 
ows 
At = A”— MX, Xi 


which is equivalent to relation (3). 
12.15 FINDING THE EIGENVALUES AND EIGENVECTORS OF A 
POSITIVE DEFINITE SYMMETRIC MATRIX 


Here we give an iteration method for finding simultaneously the 
eigenvalues and eigenvectors of a positive definite matrix [5]. 
As we know, (Sec. 10.15), if a real matrix 


A Sa [az] 


is symmetric and positive definite, then 
(1) the roots A,, A,, .-., A, Of its characteristic equation 


Ay, —À Qi, re Ay 
Ay, Aag—h..- Gon 0 (1) 
Ons Ong Ann—h 


are rea] and positive: 


446 Ch, 12. Finding Eigenvalues and Eigenvectors of a Matrix 


(2) the eigenvectors 
re 
x= : (j=1, 2, 0.0, n) 
uP 


can be taken real and satisfy the orthogonality conditions 


n 


È pP =0 fo Tek (2) 


Let us write the system for determining the eigenvector #™: 


(Gy, — Ay) HEP Hay x +... tay iP =0,° 
Gx + (aza —A,) xy? + BER + ayn” = 0, 


> e s soo o‘ r o ‘l ‘i 


By XP Han H ooo + llnn mA) P =O 
or 


l 
HEP =g (aP HAP H. o o Han), l 


l 
xP = ay (auxi + a, ote Fe SF lrn), 





1 
Ana E A, 7 (Gn - 1, yy? +a,- -1, ear + ENA Figel ain) 


L (1 1 
h = a (ap XP +a Xs? +... H Ann A) j 





Since the coordinates of the eigenvectors are determined to wi- 
thin the proportionality factor, one of them is arbitrary; for example, 
except for a specia] case, we can put x® = 1.'Generally speaking, 
system (3) can be solved by the method of iteration [5] by choos- 
ing suitable initial values x{+”, Aj? and assuming 


i -1l 
ea i 
aa A ayx P tan (=1, 2, ..., A— l), 
j=l 


n—l 
ca ajA PD Lg. (k=0, 1, 2, ...) 


It is also possible to take advantage of the Seidel process. Thus, 
we find the first root of the characteristic-equation (1) 


aA, = A? (4) 


12.45 Eigenvalues / vectors of symmetric matrix 447 


and the first eigenvector 
Fr xo hT 


xe x 
(1, k) 
Xai 


1 5 


To determine the second root A, of equation (1) and the second * 
eigenvector x”, write the appropriate system .of equations: 


itl? = 3 apxP  (i=1, 2, ..., n) (5) 


We eliminate from the orthogonality relation 
È pp =0 6) 


one of the unknowns x®, say, x. Then system (5) is replaced 
by the equivalent system’ 


n=l 


“P= avx?  (=1,2, ..., n—2), | 
ee (7) 


l 
E E (2)  y(2) 
Ae KD > Aimy, ij | 


At je] 


Setting x?.,=1, we solve (7) by the method of iteration, and 
thus find the second root i, ol the characteristic equation (1) and 
the eigenvector x; the nth coordinate of this vector is deter- 
mined from the orthogonality condition (6). The remaining roots A 
(j=3, ..., n) of (1) and the corresponding eigenvectors æ'” are 
found in similar fashion. 

We do not consider any exceptional cases that may arise in 
the use of this method. 


Example. For the following matrix find the roots 4, of the charac- 
teristic equation and the eigenvectors x” [5]: 


4 2 2 
A=|2 5 1 
2 1 6j: 


448 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


Solution The matrix A is symmetric and positive definite since 


A,=4> 0, 
4 2 
^= f | = 16> 0, 


A,= det A=80 >0 
The associated system is of the form 
Ay xP = 4x? 4 Das? + 2x, 
Aj xP = day? + 5x4? + 39? (j=1, 2, 3) (8) 
A, xP 2x4 xP + 6x 
Setting j=1 and x#=1, we get 


KP qe (P + 20 +2), 


xP =a (xP + 5x? +1), (9) 
Dy = 2a +A? +6 


We. solve (9) by the method of iteration, choosing the initial 
values 


x91 and xh%=1 


Then we get*” 9 ftom the last equation of system (9). The 
computations are arranged in Table 29. 


TABLE 29 
USING THE ITERATION METHOD TO COMPUTE THE EIGENVALUES 
- AND EIGENVECTORS OF A MATRIX WHICH CORRESPOND ' 
TO THE FIRST ROOT OF THE CHARACTERISTIC EQUATION 





















k | xo xp Fas Me 
0 1 1 I 9 

1 0.89 0.89 1 8.67 

2 0.85 0.83 1 8.53 

3 0.83 0.80 1 8.46 

4 0.81 0.78 1 8.40 
5 0-805 0.770 I 8.38 

6 0.806 0.771 1 8.383 
7 0.807 0.771 1 8.385 
8 0.8074 0.7715 1 8.3863 
9 0. 8076 0.7717 1 8.3869 
10 0.8076 0.7719 l 8.3871 
11 0.8077 0.7720 1 8.3874 


12.15 Eigenvalues / vectors of symmetric matrix 449 


We can take 


A, = 8.3874 
and 


x = | 0.7720 
l 
val 
Now put j=2 in system (8). From the orthogonality condition 
for the vectors « and x’ we have 


0.8077 xj? +0.7720 x2 + x2 =0 


07 


whence 
xP = —0.8077 x 0.7720 xi” (10) 


Substituting this expression into system (8) and putting x =], 
we obtain : 
wl (2) 
xP =L (2.3846 xi a ‘i 


A, = 1.1923 xP +4,2280 


We solve system (11) by the iteration method, setting 
x =1 and AY’ =5.42 


The results of the computations are arranged in Table 30. 


TABLE 30 
USING THE METHOD OF ITERATION TO COMPUTE THE EIGENVALUES 
AND EIGENVECTORS OF A MATRIX WHICH CORRESPOND 
TO THE SECOND ROOT OF THE CHARACTERISTIC EQUATION 
































k xe xh Mw k xe xe Ae 
0 1 l 5.42 6 0.223 i 4,494 
1 0.52 1 4.85 7 0.220 1> 4.490 
2 0.35 1 4.64 8 0.218 1 4.488 
3 0.28 1 4.56 9 0.2174 l 4.487 
4 0.25 1 4.53 10 0.2171 1 4.4868 
5 0.23 1 4.500 1] 0.2170 l 4.4867 











We can take 1,=4.4867 and x =0.2170, xP =1. 
The third coordihate is determined from the orthogonal rela- 
tions (10): 
x? = —0,9473 
and so 
0.2170 


29 9616 


450 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


The third eigenvector x‘® is determined directly from the two 
orthogonal relations 


0.8077 x2 +0.7720 x® +x =0, 
0.2170 xi” +x —0.9473 xP =0 


Putting xP =1, we get x® = —0.5673, x® = —0.3698. Hence 


l 
x» =| —0. 5673 
— 0.3698 


From the last equation of system (8) we also find, for j=8, 
À, = 2.1260 


For a check, form the trace of matrix A: 


tr A= A, +A, t A = 8. 3874 + 4.4867 42.1260 = 
= 15,0001 x44546 


Note that as a rule the roots obtained by the iteration process 
are arranged in descending order of their moduli. The eigenvectors 
of the matrix are determined to within the proportionality factors, 
and so all the solutions of (8) are: 


0.8077¢, 
0.2170c, 





where c, C,» C; are arbitrary’ constants different from zero. 


12.16 USING THE COEFFICIENTS OF THE CHARACTERISTIC 
POLYNOMIAL OF A MATRIX FOR MATRIX INVERSION 


In Secs. 12.3 to 12.9 we gave techniques for polynomial expan- 
sion of the secular determinant of a matrix. It is comparatively 
simple to obtain the inverse matrix A™~! with the aid of the coef- 
ficients of this characteristic polynomial and by forming the po- 
wers A, A®, ..., A”~1 of a nonsingular matrix A of order n. In 
this respect, the Leverrier method (Sec. 12.8) is particularly ad- 
vantageous. 


12.16 Coefficients of characteristic polynomial 


451 


Suppose we have a nonsingular matrix A of order n. Consider 


its characteristic polynomial 


det (AE — A) =)" + p MT... 
According to the Cayley-Hamilton theorem (Sec. 
A" +p AIH... 


+Pn-14 + P,E =0 


+ Pn- + Pn 
11.2) we have 


(1) 


Postmultiplying the matrix equation (1) by AT1, we get 


Att p, AT.. 


whence, for p,340, we have 


Anta ATERAT.. PaE) 


' + Pri + P,A™=0 


- (2) 


(3) 


Thus, if the coefficients of the characteristic polynomial of ma- 
trix A are known and the powers of this matrix are formed up to 
the (1—1)th inclusive, then the inverse A7!-.can easily be com- 


puted by formula (3). 


Note that if p,=0 and p,_,<0, then in order to obtain a BE 
mula containing A~! it is-necessary to postmultiply the matrix 


equation (1) by A7-?, etc. 


Example. Find the inverse A`! of the matrix 


3 4 


A= 


(see Sec. 12.8, Example). 


2 


w y = 


2 
1 
2 


— N w 


Solution We take advantage of the earlier found powers of mat- 


rix A (Sec. 12.8): 

30 
22 
18 
20 


A= 


and 


208 
178 
192 
242 


AP = 


22 
18 
16 
18 


178 
148 
154 
192 


18 
16 
18 
22 


192 
154 
148 
178 


242 
192 
178 
208 


Since the characteristic polynomial of matrix A is of the form 


det (,A—E)= 


— 4h? — 40K? — 


564 — 20 


452 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


‘then from formula (3) we get 


208 178 192 242 
Ariz J| 178 148 154 192 
=20\| 192 154 148 178 

L242 192 178 208 





30 22 18 20 f 234 1000 
_4| 22 18 16 18) 4912123] ce} 010 0)\ _ 
18 16 18 22 3212 0010 
20 18 22 30 L4 3°) | 0001 
F104 89 96 121 [60 44 36 40 
_ 1}} 897477 96 44 36 32 36| _ 
~ 10)! 96 77 74 89 36 32 36 44 
121 96 89 104 40 36 44 60 
20 40 60 80 2% 00 0 
__| 40 20 40 60 -028 0 oll 
60 40 20 40 0 028 olf 
80 60 40 20 0 0 0 28 
= 05 0 0.1 


As a check, we form the product 


Wa Bd FSO Ore - OY OH 

yae dt) 2S) OR at. 08> O~ he 
saar S 0 Aa 0s 
432 1JL o1 0 05-04 


T O QT t 
D Om D 
or OO 
= OO 


12.17 Method of Lyusternik 453 


12.17 THE METHOD OF LYUSTERNIK FOR ACCELERATING 
THE CONVERGENCE OF THE ITERATION PROCESS 
IN THE SOLUTION OF A SYSTEM 
OF LINEAR EQUATIONS 


Suppose a system of linear equations 


Ax=b (1) 
has been reduced to a form convenient for iteration, 
x=Bptax (1’) 


According to the method of iteration (Sec. 8.8), the successive ap- 
proximations of the solution x of system (1') are determined from 
_ the formula 


K™—BPpaxP  ě (m=1,'2, ...) (2) 
where «© is an arbitrary initial vector. ; 
We assume that the eigenvalues ^, A,, ..., A, of the matrix a 


are distinct, and 
Ay] > |Ag] = --- 2 [A (3) 
The process of iteration (2) converges if 
Ji) <1 
The first eigenvalue A, may be approximated with the aid of 
methods indicated above (Secs. 12,11 and 12.12). L. A. Lyuster- 
nik [6] demonstrated that by using the eigenvalue A, it is pos- 
sible to substantially improve the convergence of the iteration 
process (2) for solving system (1’). We will now show how this is 
done. 
For m sufficiently large we can put, approximately, 
xxx 


Let us estimate the error *—«'”. Provided the process (2) is 
convergent, we have 


m 
lim xe = x 4 > (a — xD) 
m+ wo k=] 


and besides 
m 
m= > (x!) — gD) 
k=1 


Therefore 


ie 2) 


x— x ™ = > (xH — gE) = 


k=zm41 


= [x+ =. gt] + [emd — gins dy) Aag (4) 


454 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


Since 
iP AOD — [B Harto] — [B+ axl] pee 
=a (FWD — $D) = ght (xt — x) for k= 1, 2, 
it follows that 
L— x =a” (0? — x) amt (x? — x) seriy (5) 
Let Yı, Yə ---, Yn be the eigenvectors of the matrix a that 
correspond to the eigenvalues M, A, ..., 4, and form a basis 


in the space £,. Expanding the vector #%—x™ into the vectors 
of this basis, we get 


xD) — xo = CV, +O,V5 +, »  +O,9, 
where c, (j=1, 2, ..., n) are certain definite scalars. From this, 
x) gD =y koi (xB — x!) — 
=c, Ay -+e,y dey, + ...-+c, Atty, (6) 
(k=m+1, m+2, ...) 
Hence, we find, on the basis of (5), 
x—x"” =c (1 +a HM ve M+ 
a 
ESEE E a, = 








eg y+ Ya +.. + ot. Yn 
Then, taking into account inequality (3), we obtain, 

sam E y, +o (ayy (7) 
Besides, from formula (6) we rae for k=m+1, 

wiry — x =o My, + 0 (Az) (8) 


and so 
x41) my 
eK) = = + 0 (8) 
Thus, we finally have 
ein + gin) 


sae aiiis ias ee (9) 


The additional term ae perceptibly accelerates the con- 
= INE 


vergence of the iteration process (2). 
Since it follows from (8) that . 


iD gi =A, (41 — x07-b) + 0(AR) (10) 


A 


12.47 Method of Lyusternik i 455 


formula (9} may be replaced i the following one: 


X mz 





— x-0) (11) 


Formula (11) makes it unnecessary ae compute the next, in order, 
approximation. 

On the basis of (10), the largest eigenvalue ^ may be deter- 
mined from the formula 
(xi) xlm—1)); : 
hy gmn By (i=1, 2, wee, M2) 
In the case of a symmetric matrix a, we get a more exact for- 
mula by using the method of scalar products: 


(x6) yn- xl ynd) 
(xl -1)_— gle) xm — xl —1)) 


hy & 


In particular, if 


xo — B 
then 
1?) — xT ym (x) — x) z= ap 
and 
x = yl ze (xP — x) = Zob 
Therefore 
h a rhi E E E (12) 


where (@”B); and (a”~#B); are the ith coordinates of the vectors 
«7B and «”~!B, respectively. Similarly, if matrix a is symmet- 
ric, then 


(af, a7p) 
hy © mip, ap) (13) 


Example. Using the method of iteration, solve the system of 
equations [1] 
0.78x, —0.02x, —0.12x,—0.14x,=0.76, 
—0.02x, + 0.86x,—0.04x, + 0.006x, = 0.08, 
—0.12x, —0.04x, + 0.72x, —0.08x, = 1.12, 
—0.14x, + 0.06x, —0.08x, + 0.74x, = 0.68 


and apply the Lyusternik method to improve the roots. 


456 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


Solution. Reduce the system to a form convenient for application 
of the iteration method: 


x, = 0.22x, +0.02x, + 0.12x, +0.14x, 4 0.76, 
x, = 0.02x, + 0. 14x, + 0.04%, —0.06x, -+ 0.08, 
x, = 0.12x, + 0.04x, -+ 0.28x, + 0.08x, + 1.12, 
x, = 0. 14x, —0.06x, + 0.08%, + 0.26x, + 0.68 


(14) 


or, in matrix form, 


ex7 70.76) 0.22 0.02 0.12 0.147 7x, 
x,| |0:08] {0.02 0.14 0.04 —0.06 |] x, 
x,|—|1.12}7]0.12 0.04 0.28 0.08 || x, (14°) 
x,| [0.68] |0.14 —0.06 0.08 0.26]) x, 


whence 
0.22 0.02 0.12 0.149 -0.76 
0.02 0.14 0.04 —0.06 0.08 
@=10912 0.04 0.28 0.08; and B=] 149 
10.14 —0.06 0.08 0.26 | 0.68 
Since 


|| |, = max{0.50, 0.26, 0.52, 0.54)=0.54< 1 


the process of iteration for system (14) converges. 
Using the vector B as the initial vector x™, we obtain, for the 
mth approximation x'” of the desired solution 


the following expression 
xO) Dd ap (15) 
k=O 


Thus, in order to compute x we have to form successive 
iterations of the vector B by means of matrix a. We have 


0.22 0.02 0.12 0.147 70.76] 0.3984 
0.02 0.14 0.04 —0.06 |f 0.08 0.0304 
aB=| 0.12 0.04 0.28 0.08 |] 1.12 = osx > 


0.14 —0.06 0.08 0.26]/0.68} 10.3680 


12.17 Method of Lyusternik 457 


0.22 0.02 0.12 0.14970.39847 0.195264 
0.02 0.14 0.04 —0.06 || 0.0304 0.008640 
aB =a aB =| 0.12 0.04 0.28 0.08 || 0.4624 | =| 0.207936 
0.14 —0.06 0.08 0.26 |i 0:3680 0.186624 
and so on. 
The results of the appropriate computations are listed in Table 31. 


TABLE 3] 
SUCCESSIVE ITERATIONS OF THE VECTOR PB BY THE MATRIX a 








af Ooa o |O aĝ ap aip 






















0.195264 0.09421056 0.04527913 
0.008640 0 . 00223488 0.00055572 
0.207936 D. 09692928 0.04589292 
0.186624 0.09197568 0.04472340 
8 
T aĝ | ap x= > akp 
R=0 
0.02174095 0.01043649 000500961 0.00240463 1.532746 
000013570 0 .00003285 0.00000792 0 . 00000190 0.122009 
0.02188361 0.01047017 0.00501 763 0 .00240654 1.972937 








0.02160525 0.01040364 0 .00500170 0.00240272 1.410737 


In formula (11) we take m=8. Since the matrix œ is symmet- 
ric, to compute its first eigenvalue 4, we use the method of sca- 
lar products. We have 

z, ~ (P, aB) _ 

1X (aB, aB) 

240 , 463? + 1902 -240 ,654? + 240 , 2722 
= 500, 961 240,463 +792. Raae 763-240, 654 -+ 500,170:240, 272 





= 0.480000 
whence, taking into account that +#'?—x-=a5B, we find 
oe 
aw xPth,- ia. = 
1 an 0.002405 1.534965 


0.122009} 12 | 0.000002 | _| 0.122011 
= | 1.972937 | + I3 | 0.002406 | 7 | 1.975159 
| 1.410737 0.002403 1.412955 


458 Ch. 12. Finding Eigenvalues and Eigenvectors of a Matrix 


To compare. we give the values of the roots of system (11) 
obtained by the Gaussian method [1]: 


x, = 1.534965, x, = 0.122010, 
x,=1.975166, x, = 1.412955 


Thus, whereas x'® yielded values of the roots x; ({=1, 2, 3, 4) 
with a rough accuracy of 1-1073-2-1073, the corrections of Lyus- 
© ternik yield these roots with an approximate accuracy of 107°. 

The Lyusternik method of accelerating convergence can also be 
applied to the Seidel process. As we know, the Seidel process for 
system (2) is a process of iteration for the equivalent system 


x=B,+0,* 


where the matrix a, is uniquely defined in terms of the matrix 
a (see Sec. 11.3); namely, if i 


a=B-+C 


where B is a lower triangular matrix with zero diagonal and C 
is an upper triangular matrix, then 


&,=(E—B) C 


For this reason, if &'” (m=1, 2, ...) are successive Seidel appro- 
ximations of the root x of the system (2), then we can put 


(m+l)__£(m) 
x gmp BM OET 
Mi 


where p, is the numerically largest eigenvalue of the matrix œ. 

There are also other methods for accelerating the convergence 
of iteration -processes in the solution of systems of linear equa- 
tions, such as the method of M. K. Gavurin [7], [8] and the 
method of A. A. Abramov [9]. l 


REFERENCES FOR CHAPTER 12 


[1] V. N. Faddeyeva, Computational Methods of Linear Algebra, 1950, Chap- 
ter III (in Russian), 2 

[2} 7. M. Gelfand, Lectures on Linear Algebra, 1951, Appendix ! (in Russian). 

[3] A. G., Kurosh, Course of Higher Algebra, 1972, Chapter 6 (translated from 
_the Russian). 

[4} Harold Wayland, Expansion of Determinantal Equations into Polynomial 
Form, Quarterly of Applied Math. 1944, Vol. II, No. 4. 

[5] W. E. Milne, Numerical Calculus, 1949, Chapter I] 

[6] L. A. Lyusternik, Transactions of the Steklov Institute of Mathematics, 20 
(1947) (in Russian). 

[7] M. K. Gavurin, Application of Polynomials of Best Approximation to the 
Acceleration of Convergence of Iterative Processes, Uspekhi Matem. Nauk, 
5:3 (37) (1950), (in Russian}. 

[8] Z. S. Berezin and N. P. Zhidkov, Computational Methods, 1959, Vol. 2, 
Chapter VIL (in Russian). 

[9] D. K. Faddeyev and V. N. Faddeyeva, Computational Methods of Linear 
Algebra, 1960, Chapter IX (in Russian). 


Chapter 13 


APPROXIMATE SOLUTION OF SYSTEMS 
OF NONLINEAR EQUATIONS 


13.1 NEWTON'S METHOD 


We consider, generally speaking, a nonlinear system of equa 
tions : 


ee P 
Fa Xir Eaa Aa) =O 


with real left members. 
. We write system (1) more compactly by regarding the set of- 


arguments x,, Xa ..., x, as an n-dimensional vector: 
xy 
X; 
s= 
Xn 
Similarly, the set of functions f4, f,, ..., f, is also an n-dimen- 
sional vector (vector function): 
fy 
yall 
| ee 
The system (1) can therefore be written briefly as 
f(x)=0 a’) 


We solve (1’) by the method of successive approximations. 
Suppose we have found the pth approximation 


me (p) (Pp) (py 
PY (NR Phe co sa NS 


of one of the isolated roots x=(x,, x,, ..., X) of the vector 


460 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


equation (1’). Then the exact root of (1’) can be represented as 


x= xP? + el”) (2) 
where e = (eP, eP, .:., eP) is the correction (error of the 
root). 

Putting (2) into (1’), we have 
f(x? +e) =0 (3) 


On the assumption that the function f íx) is continuously diffe- 
rentiable in some convex domain containing x and x, we expand 
the left-hand member of equation (3) in powers of the smal] 
vector e” confining ourselves to linear terms, 


S (x? +e”) = f(x”) +f’ (xP) er —0 (4) 
or, in expanded form, 
h (xf? +e, xg?) + el, ee xP eP) = 
= f (P xP, aaa PE fin (P AP, aaa P) e 
E E E vos 
weta a a oa Xie? = 0, 
hP eP, xP +e, ..., AP eP) 
= fa (XP KP, aa P) H Fag P AP aaa P) EP +E 
Ppa OT a o a er (4’) 
cfg OP RP day PERSO, 
Fr (XP+e, PEP, a, xP Hep) 
fg (XP, PL oy APH Fin, AP, Py POPE 
he Fi HN HPs ka kg EP EPPA ok: 
ees Gat EO. 


From formulas (4) and (4’) it follows that the derivative f'(x) 
is to be understood as the Jacobi matrix of the set of functions 
fis fe «+e fa with respect to the variables x, x, ...3 x; that is, 








ofh h ðf T] 
Ox, Ox, 7 OX, 
Of, Of, fa 
Fa (x)=W (x)= Oxy OX» et On 
on fm te 





L OX, Ox, 7? Oxa J 


13.1 Newton's method 461 


or, briefly, 
, iy OF; vos 
F= w= E] =1,2, 0) 
The system (4’) is a linear system in the corrections e% 
(i=1, 2, ..., n) with matrix W (x), and so formula (4) may be 


written as follows: 
F (AP) +W (x) 0” =0 
whence, assuming that the matrix W(«*”) is nonsingular, we get 
gl? = — Wo (xP) F (x?) 
Hence 
xP =e _W-1 (xP) F(x) (p= 0, 1,2.) 5) 
(Newton's method). 


For the zeroth approximation x‘? we can take a rough vae 
of the desired root. 


Example 1. Approximate the positive solutions of the following 
system of equations (cf. Sec. 4.9): 


fi (Xis X3) = x, + 310g% —42=0, 6) 
h(t %,) = 2—5 +1=0, (6) 
Solution: The curves defined by system (6) intersect approxima- 


tely in the points’M, (1.4, —1.5) and M, (3.4, 2.2). Starting with 
the initial approximation 


o [34 
X =|o9 


we compute the second approximations of the roots, carrying the 
computations to four decimal places. Setting 


h ay xa) 
P= a a 











we have : 
3.4 + 3log,, 3.4— 2.2? 0.1544 
Pe 5 5 4 ~ 
2.3.4 —3.4.2.2—5.3.4+ 1 — 0.3600 l 
Now form the Jacobi matrix 
r of ðh 3 
Som jie e 
W (x)= af. ate =| 1 
= (4%, —*%,—-5 — xX J 


L Ox, Ox, J 2 


462 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


“where M = 0.43429, whence 
(E 3-0.43429 


: 
yim iter 1222 ]_ [13832 —4.4 
ee) 84 34 
43422-5234 | LS 


and 
A = det W (x'”) = 23.4571 


Thus,. the matrix W(x) is nonsingular. Form the inverse 


x 1 f—3.4 4.4 
ag ie a le es) 


Using formula (5), we get 
[os 1 —3.4 44 | 0.1544 _ 
= = | 9.9 | — 35471 ee 1.3832 | eee = 
3.4 i —2.10896 3.4 0.0899 3.4899 
= oe 334571 E = Be T hee a eo 
The subsequent approximations are found analogously. The re- 
' sults of the computations are listed in Table 32. 


TABLE 32 
SUCCESSIVE APPROXIMATIONS OF THE ROOTS OF SYSTEM (6) 


| e,=Ax, 








0 3.4 0.0899 2.2 

1 3.4899 —0. 0008 |: 2.2633 
2 3.4891 —0.0016 2,262) 
3 3.4875 2,2616 





Stopping with the approximation x, we have 
x, = 3.4875, x, = 2.2616 


0.0002 
FE) = es 


Example 2. Use the Newton method to approximate positive solu- 
tion of the system of equations. 


etyte2=1, 


and 


2x” + yy—4z=0, 
3L — 4y + 227=0 


13.1 Newton’s method 463 


starting with the initial approximation 
x, =Y¥,=%,=0.5 


x+y + 22—] 
poa tea | 
3x? — 4y + 2? 
0.25 +0.25 + 0.25— l — 0.25 
E E a 
0.75—2.00 + 0.25 —1.00 


Form the Jacobi matrix 


2x 2y 2z 
W (x) —|4x 2y —4 


6x —4 2z 


l 1 1 
W (x)= | 2 1 —4 
3 —4. 1 


lo 1l 2 
det W(x) —|2 1 —4 |= —40 
3 1 


Solution. We have 


whence 


We have 


and 


The inverse matrix is 


B ee 
: —15,—5 —5 8 8 8 
PrE Se) B/S). a 
(x)= — g =| 35 36° 20 

| 4 a Tl 


Using formula (5), we obtain the first approximation: 
KY = Wm O) f(g) 
3 ] 0 


[0.5 ae i —0.25 
Ba le) a oer |e 2 E 
[0.5 z 1 | [—1.00 


2 
Lo 40 P, 


10.5 0.375 0.875 | 
=| 051+) 0 ees 0500 | 


0.195 | 40.375 





464 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 
Then we compute the second approximation x? to get 
0.875? + 0.500? + 0.375?—1 |. | 0.15625 

Ff (x) = | 2-0.875? + 0.500? —4-0.375 | =| 0.28125 
0.43750 


3-0.875? —4-0.500 + 0.375? 














and 
2-0.875 2-0.500 2-0,375 
vey- 0 2-0.500 —4 |- 
6-0.875 —4 2.0.375 
1.750 1 0.750 
= 53 1 —4 | 
5.250 —4 0.750 
whence 
1.750 1 0.750 
det W (x™) = | 3.500 1 —4 = 
5.250 —4 0.750 
1.750 1 0.750 
=| 1.750 0 —4.750 | = —64.75 
12.250 0 3.750 
and 


i —15.25 —3.75 —4.75 
W(x) = — aT —23.625 —2.6250 9.625 
f — 19.25 12.25 —1,75 
Using formula (5), we obtain 
xd = xP W (xD) f(x) = 
0.875 i —15.25 —3.75 —4.75 0.15625 
= | 0.500 + eB —23.625 —2.6250 9.625 0.28125 |= 
0.375 — 19.25 12.25 —1.75 0.43750 
poole 0.08519 0.78981 
=: 0.500 |— i 0.00338 |=| 0.49662 
| 0.375 0.00507 0.36993 
The subsequent approximations are found similarly: 
0.78521 0.00001 
x = | 0.49662 |, f(#)= | 0.00004 
0.36992 0.00005 
and so forth. 


13.2 General remarks on convergence of Newton process 465 


Stopping with the third approximation, we get 
x= 0.7852, y=0.4966, z= 0.3699 


13.2 GENERAL REMARKS ON THE CONVERGENCE 
OF THE NEWTON PROCESS 


In Sec. 13.1 we presented a formal aspect of the Newton method. 
The conditions of convergence of this method for a system have 
been investigated by Willers, Stenin, Ostrowski, Kantorovich, and 
others. Below we give a special case of the Kantorovich theorem 
(Theorem 1) [I] on the convergence of the Newton process in func- 
tional spaces as applied, to finite systems of nonlinear equations; 
for the sake of simplicity we use rough estimates. Following 
L. V. Kantorovich, we also establish the rapidity of convergence 
of the Newton process, the uniqueness of the root of the system 
and the stability -of the process with respect to choice of the ini- 
tial approximation (Theorems 2 to 4), As a particular case, we 
obtain the Ostrowski theorem [2] on the convergence of the Newton 
process for an equation with an analytic complex right-hand member. 

In the sequel it will be convenient to regard the sets of func- 
tions as vector functions or matrix functions. To simplify the pre- 
sentation, we will generalize the concept of a derivative to these 
cases, i 


Let x= (xi, ..., Xp} and 
| fi. (*) | 
f(x) = 
ae (x) | 
where f,EC™ (G=1, 2, ..., A). 

Definition 1, The derivative f'(x) is understood to mean the 
Jacobi matrix of the set of functions f; (== 1, ..., n) with respect 
to the variables x,, ..., x,, that is, 

rie | 2E | 
f(x)= [Fe] (1) 


The matrix function 


| fa (x) ee Far (*) 


aN 9616 


466 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


may be regarded as a set of m vector functions 


fa) fex) 
Poi: PERE ie 
tfa (¥) 5 “ie RY] 


Therefore, it is natura] to take the derivative F’ (x) as meaning 
the set 


F' (x)= [Fi (x) ... F; (*)] 


where 
OF te frp 
OX, OX 
F; (x) ell) e a ee a A! & 
Ofnk nk 
ie Oxy OX, 3 


are Jacobi matrices (R=1, 2, ..., r} 


Definition 2. If F(x) = [f;(x)] is a functional matrix of dimen- 
sions axr and f,;(x)EC%, then 


F’ (x) = [F;.(*)] (2) 
where 
Fi (x) = [Fe | (i i=l, 2,...,m k=l, 2, ..., r) 


Ox; 


In particular, if the vector function CS (x)= [f;(*)] is such. that 
f,(x)EC™, then 


f” (x)= RA (x) SwF W, (x)] 


where 





w=] Ohi | i ey ee 


OX OX j 


In this section we use the m-norm (Sec. 7.7) for estimating 
matrices; the subscript m will be omitted for brevity: 


IF (x) [l= max | f; (+), 


LF’ (x) Eak oe ar 





zro) 


Ox, Ox Ox: etc, 


IS” (=)= max |W ( x)= max (max) $ 








13.2 General remarks on convergence ‘of Newton process 467 
Similarly 
n 
F(x) |= max $ | fy (%)] 


ofiz (x 


IF" (x) | = max > a= 


First, we will derive several estimates, similar to the mean- 
value theorem, for the m-norms of the differences of values of 
matrix functions, which will be useful in the sequel (cf. [1]). 


Lemma 1, // 








F (x) = [fa (x) (n xr) 
where Îi (x) are continuous, together with their first-order ` partial 


derivatives, in a convex domain containing the points x and x-+ Ax, 
then 


|| F (x + Ax)—F (x) S r |] Ax]]-]| F’&)] (3) 


where §&=x+0Ax, 0<0< 1 and the matrix norm is to be under- 
stood in the sense of the m-norm. 


Proof. Using the Taylor formula, we abtan 





F(x +Ax)—F (x)= [fy (x + Ax) — F; (%)] = I p> “as, | 
k=l 


where =x +0; As 028, <l i=l, Zo set j=l, De 
Whence, "fxing x and xx, we get 


Of; oe 





Ax S 





[F(x + Ax)—F (x) | = max >| 4 


< max PAL EE 


È j=lk=] 


dfi; (Ši 
E > mee ax | |= 











OF; I 
Sefaria > fis liD | 


OX, 





1)-Since, obviously, for any finite set of numbers {aj} we have 


max (max a; j) = Max a; , 
i i i,j 


468 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


Since the number of pairs (i, j) is finite, there is a pair (p, q) 
such that. 




















n ee 
Ofir (Be) F ng (Spq) OF 7( Soj 
max ext < max y F’ 
where §=€,. l 
Thus E 


|F (x + Ax) —F (x)| Sr Ax [||] PF" @) Il 


which completes the proof. 


Corollary 1. li 
fı (x) 
ree] | 
f,(*) 


IF (w+ Ax)—fF (4) || <<] Ax] F" (6) 


where €=x-+0Ax and 0<0< 1. 
Here r=}. 


Corollary 2. For f(x)EC® we have 
IE (* + Ax)—f’ (x) || <n] Ax INF" O 
where =x +80Ax and O<6< 1. 


Lemma 2. J] 
jh (x) 
f (x) os : E C 
) 


fna (x 
in a convex domain containing the points x and x-L Ax, then 
LFH dx)—f (re (x) Axla @l 4) 


where =x -+ 9Ax and O< O< 1. 
Proof. Using the two-term Taylor formula, we obtain 


IZ (4+ Ax)—f (x)—f’ (x) Ax ||= 


then 








A [Fix + Ax) — —fix)— (x)] l= 
-ig Btu Ar raf 
4 <a E E eo ol] |< 
i k 

















13.3 Roots of a system and converg, of Newton process 469 


[EES 


PF: (i) ll (5) 


1 


2 x 
SS ee T A hi Bo 


Ox; BAIA 

















Ox 7 OX p 





IFS 


where §;= x +0;Ax, 0<0;< 1. 
Since 





























PF; (Ee) f: (a E 
Blaran | STA 2a aoa 
= fy (Sp) P(E) | pr 
Ll sae |< mx Ll aa, TI Gl 
k -f k 


then from inequality (5) we get (taking into consideration the 
meaning of the norm) 


If (2+ Ax)—F (x) f(x) Ax |< | Ax e F E= 
=F Ax PF’ E) 
where §=§,=*x+ 0Ax and 0<0< 1. 


“43.3 THE EXISTENCE OF ROOTS OF A SYSTEM 
AND THE CONVERGENCE OF THE NEWTON PROCESS 


Theorem 1. Given a nonlinear system ‘of algebraic or transcen- 
dental equations with real coefficients: 


f(x) =0 (1) 


where the vector function 
fi Xis ay Xa 
fix) =|} + sse: ù 


is defined and continuous, together with its partial derivatives of first 
and second order, in a domain œ, that is 


f(x) EC” (0) 


ya 


470 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


Suppose x is a point lying in œ together with its closed #-neigh- 
bourhood: ~ 
U gy (4) = {xx || << H} co 


where the norm is to be understood as the m-norm” (see Sec. 7.7), 
the following conditions being valid: 


(1) the Jacobi matrix W (x)= | has the inverse Fo = W7! (x) 


for x =x, where 
To I< Ay” 


2) To f(#) || < B, =o 
OF; (x 


(3) > jum < 


when i, S) 2, „jn and x€ U gy (x), 
(4) the M A B, and C satisfy the OE 











=2nA,BC<1 (2) 

Then the Newton aoe 
PAD oe yl P) mh (et) f (x) (3) 
(p=0, 1, 2, ...) converges for the initial approximation x™ and 


the limiting vector 
x*= lim xP 


poe 
is a solution of system (1) such that 
[x*—2x [<< 2B, << H 
Proof: We introduce the notation 
hy =|] e079? — xt? || = max |x? — xP, 


r= (x) (p= 0, 1,2...) 
From formula (3) we have 
h =]T F(x) || 
Proceeding from the conditions (1) to (4), we obtain estimates 
for the quantities T, and Tf (x'”)). 


V That is, if A=J[a;;), then 
NAN=[14 lla = max Žil ayl 
Ass 
2 In other words, if W (x) =[a;,), then Pp = W ~! (x10) = pa , Where 


A;j are the cofactors of the elements a; d A= det [a;;] and, consequently, 


ij 


raeng pr Àa 


13.3 Roofs of a system and converg. of Newton process An 


First consider the case p=1. Using Condition (2}, we have 


hy =|] e — =W (2) F (2) |< B < Z 

and so 

hy S Bo 
and 5S _ 

U g (xc U gy (x) 
2 
To estimate [T,=W-*{x), take advantage of the relation 

(AB)-1=B-1A7~ i and represent this quantity as 


T= [W (x®) TW (x)= [TW Ge) eT, (4) 
Taking into account Condition (1) of the theorem, we have 
| ET (x) =T [W w s 
<I Po [||] (x) —W sS Al W t) — W o 
Since from Condition (3) follows 


Is” (x) A max 5 ofi (x) 
t, Í k=] 


Ox ; OXE 





<C 








then by virtue of Corollary 2 of Lemma 1 we have 
|W (x) —W (x) [POF OOS 
<n || xP—x |C <nBC 
and so 
J E—ToW (4) |] <n A,B,C =" ct 


Consequently (Sec, 7.10, Theorem 5, Corollary), there exists 
the inverse matrix 


[TW (x) = {E—(E—T,W (x™))} 7? 
And since ||£||=||£||,=1, it follows that 
IEW (x) <P (5) 
; -2 
We now derive from formula (4) that 
WPS EW Ce) NIV ol << 24. = Ay (6) 
Formula (3) implies 
f(x +f’) (x)(x — 4) — 0 


472 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


whence, on the basis of Lemma 2, we have 
EXD || =F (a2) —F (4) —F A) (4 — 4) | 
l 


1 n 
< Fal) xO eF E4 BWC 


where 
E=7%+48(eY— x) and OS O< 1. 


Therefore, taking into account inequality (6), we obtain 


TF (eH) WUT ANS I< 
<2Ay-4 NBIC = nAgBiC = mB =B, (7) 


Thus, for point x we have 
U g(x) cUg (x) co 
and, besides, 
IDIS Aa, h =|| T F(x) B, 


A, =2Ay, 
] g 
B:=5 PoBo < ve 


where 


whence we obtain 
1 F 
u, = 2n A,B,C = 2n- 2A MoBo = po: 2n ABC = pi <1 (8) 


Thus we are again in the conditions of the theorem with the 
sole difference that instead of the neighbourhood Uy (x) we have 


the neighbourhood U yy (x) imbedded in the former. 


2 
Repeating similar arguments, we establish that the successive 
approximations x‘? (p=1, 2, ...) are meaningful and such that 


Uy (x) DU gy (4) 3.505 DU y (xP)... 
E “QP 


Also 
WP, =| W Pis A, 


ELF (a) j= eer — x SB, 


where the constants A, and B, are connected by the recurrence 
relations 
2Ap-1, j 


1 


A,= 
P 
$ 9 
B, =7 Bp-1Bp-1 } (9) 


13.3 Roots of a system and converg. of Newton process 473 


and 
Pp = 2nA, BC (p= l, Qe sus) (10) 

We will show that the Cauchy test (Sec. 7.9) is valid for the 
sequence of approximations x?) (p=0, 1, 2, ...). Indeed, for 
g >> 0 we have 

xro g U (xP) 
E 

and so 


i H 
xerox <e 


ii p>N and q>0, which is equivalent to the Cauchy test. From 
this it follows that the limit 


lim « — x* € U gy (x) 


p> œw 

exists. 

Now let us assure ourselves that x* is a solution of system (1). 
From the relation (3) we have 

F (XP) W (x0) (x20 x) =0 
Passing to the limit in this equation as p— oo and noting that, 
in the process, 
KPED — xP) 5 0 

and also that W(x”) is continuous and bounded in U gy (x), 


we have 
lim f(x”) =0 


p> @ 


Whence, by virtue of the continuity of the function f (x), we obtain 


f ( lim a) = f (x*)=0 


p+ © 


That is, x* is a solution of system (1). Besides, 


"2 | = < 














ve] 
pa [erto — xP] 


< 5 [| x79 — r” I< <28 SB tt... = BSH 
p=0 p=0 
The proof of the theorem is complete. 
Note 1. If f(x)€C® (wm) and system (1) has, in the domain o, 
a simple solution x*, that is, such that 


f(x*)=0, f'(x*)=W (x") 40 


474 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


then the conditions of Theorem I are clearly valid for every point 
x” sufficiently close to x*. 

To verify Condition (2) it is useful to note that B, yields an 
estimate of the divergence between the initial and the first appro- 
ximation of the Newton process: 


| Dy Ff (x) | = Il xP — g |< Bo 


and so this inequality can readily be verified as soon as the appro- 
ximation x® is found. 


Note 2, Analogous statements are obtained for the theorem of 
convergence if we use the norm ||A|j, or || All, in place of the 
norm |] A la- 


*43.4 THE RAPIDITY OF CONVERGENCE 
OF THE NEWTON PROCESS 


Theorem 2, Jf Conditions (1) to (4) of Theorem 1 (Sec. 13.3) 
are fulfilled, then the following inequality holds true for the succes- 
sive approximations x (p=0, 1, 2, ...): 


eer (a)? WB, 
where x" is a solution of the system and p, is found from formula (2) 
of Sec. 13.3. 
Proof. Using the relations (9) and le of Sec. 13.3, we have 


up = 2nA,B,C = 2n-2A,_1° + Hp-1Bp-1 C= 


= poy 2NA,_ Bp iC = pp 
From this we obtain 


hy = H» 
— ae 
HEE mt) 
P 
Pp = bo 
Furthermore, 
1 
By =F Up-1Bp-1= 7 B0 By 
and so 
B eres eee 2° Bm 
p= bo “yo a ze 


"a -+1B, =(7 Yh 20-18, (2) 


13.5 Uniqueness of solution 475 


Since 
|x") — 2? <B, 
we have, for g> 1, 
|| KPD — yP) lI < | PHD — gip) |] + 
+ li P+) yg PHD Il eee +] XPD — g+- I< 
<B, +Bp t cae oa 


“(gyeton ria Pe ne 
R (z) eB, ES ihe aac (5)? ry] 


Whence, taking into account that p, <1, we obtain 
xto gP |] < (SYni"'B, [! ee Mews (+) ] < 
1 \P-1 2P 
<(z) eiB, 


Passing to the limit as q — œ, we finally get 
|e << (5) B (i e 
where 
By = 2nA,B,C <1 
Thus, for pọ < 1 the convergence of the Newton process is super- 
fast. For p=0 we have 


|| x*— x” |j < 2B, KH 


*13.5 UNIQUENESS OF SOLUTION 


Theorem 3. Given the Conditions (1) to (4) of Theorem 1 (Sec. 13.3), 
there is, in the domain 


x= || S28, (1) 
a unique solution of the system (1) (Sec. 13.3). 


Proof. Suppose that besides the solution x* of system (1) of 
Sec. 13.3, which solution is defined by the Newton process, there 
is another solution’ x** of the system such that 


\|x*— x |] < 2B, (2) 


The successive approximations x” (9 =0, 1, 2, ...) of the Newton 
process lie in the neighbourhood (1) and satisfy the condition 


SRP) + W , (XPD — xP) = 0 


476 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


where 
W,=W (x) 
From this, taking into consideration that 
F(x") =0 
we get 


W, (21) <2") = f (x) —f (x) —W, (x8 — 2°) 
and, hence, 
xe — xan, [f (xt) —f (2) —W,, (x —~ 2] 
where 
r, =W7 
Estimating by the norm, we get 
[etrn < Ep E =F (4) —W, em x) |] 
According to the notations of Sec. 13.3 (see Theorem 1), 
IT IIS 4, 
Applying Lemma 2 of Sec. 13.2, we get the inequality 
Pe) —F (2) —W, (2 — 2) |] << nC || 2 — 2 Ip 


where the constant C is defined from Condition (3) of Theorem 1. 
For this reason 


li xe — glee Il < nA,C | xt 





(p=0, 1, 2,...) (3) 
Putting p=0 in inequality (3) and using inequality (2), we obtain 
ata || <> nA,C || a x |]? < 2nA,BIC 
or, introducing the numbers defined by the relations 
Pp = 2nA B,C ) 
1 (p=0, 1, 2, cas 4 
Bp =p HpBp j ) ( ) 


we find 
‘ |] x — || S poB, = 2B, l (5) 
Similarly, for oz we derive from formulas (3), (5) and (4) 


x — x] <i nAC| x —x'0 Je <n BIC = pB, = 28, 


13.5 Uniqueness of solution 477 


Generally 
| a*— x?) |< 28, (p=0, 1, 2, ...) (6) 
Since on the basis of formula (2) of Sec. 13.4 the quantity 


B p—0 as p— oœ, then by passing to the limit in inequality (6) 
we get 


x* = lim xP = x* 


p> 


That is, the solution of system K; in the domair || x—+ || <28, 
is unique. 


Note. If the domain Ug (x) is such that 
2 P 
A B, SH 


then in the extended domain (1) 


~ 


e=” | <= B, (7) 


there are no solutions of system (1) other than x*. f 

Indeed, assuming that in the domain (7) there is a solution x* 
of system (1) (Sec. 13.3) and repeating the reasoning of the theorem, 
we obtain an inequality of the lorm (3): 


e—a ] < - nA,C || x*—x'? |P 


where x” (p=0, 1, 2, ...) are successive approximations of the 
Newton process with initial approximation *«™. From this, since 


2 
|x*— x” <2 8, 


it follows that by using the numbers up+ı = pp we get successively 
[ata Sy AC Bi = 
j Ho 
1 1 2 2 
B B B 
pee Bb © p t h Bi 
Ler || <4 nA,C - a Be = 2nA,B,C - + uB: Cae 
1 





=2nA,B,C - 














2 2 2 
= -B -— oe =o 
Hi By =m B= eB, 
and so forth. 


Generally 
EE 2 = 
|] a*— x |< is B, (p=0, 1, 2, ...) 


473 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


Since 
1 
B, =F Mp-1By-1 
and 
Mp = Bp 

then 

By 1 Bp~1 1\P By 

ip =e (8) 


The latter relation can also be obtained directly from the formu- 
las (1) and (2) of Sec. 13.4. 

Thus 
poi Bo 


è 1 
lar ee ls (z) Ho 


(p=0, 1, 2,...) 
Hence 
x? = lim xP =x 


p> w 


which is what we set out to prove. 


*43.6 STABILITY OF CONVERGENCE OF THE NEWTON PROCESS 
UNDER VARIATIONS OF THE INITIAL APPROXIMATION 


Theorem 4. /f Conditions (1) to (4) of Theorem 1 (Sec. 13.3) 
are fulfilled and 


2 By <H 

Ho 
where Yo = 2n A,B,C <1, then the Newton process converges to a 
unique solution x* of the system (1) of Sec. 13.3 in the main domain 


\|x—x || < 2B, for any choice of the initial approximation x in 
the domain — 


x 





0 (1) 
Proof. By analogy with the above-introduced notations 
W,=W (x®) and I, =W 
we introduce the designations 
Wi=W (x'®) and To=(W)! 
We will show that conditions similar to the Conditions (1) to 


(4) of Theorem 1 will hold true at the point x’. 
Using the notation and the method of proof of Theorem 1, we get 


LE—T,W; =T, (W.-W) < 
<D IW W< A nC || x — x | 


13.6 Stability of convergence of Newton process 479 


whence, taking into account inequality (1), we obtain 








i i— 1— 1 

| E—TWoll< Ane B= <q 

Hence 
Wo) = I [E—-(E—-T.W oy) S 
l 1 4 
s 1—|| E—r,w; |l s |Zo  3+m (2) 
4 
And so there exists 
Py = (TW)! 

and 


Poo) MIT I< ct =A’ (3) . 


We then derive 
IPF (4) SPS IF (4) —F (4) — 
oe p (471) — x) |+ IP F(x) I+ x7) ygt |< < 


ree = A fC || x7 — x P24 Bi +i xO — x |< 





~Z 
<i 1 pB eats +B, + “i B,= 
Lo 
Ipo + pot 1694+ 8— Bio B = eae B 
16 “Ten, ~e 
From this, using inequality (2), we have 
o IEF =I 0) Dok (0) |S 
<TH) THE F A) S 


4 (3+ Ho)? _ 3+Ho _ pe 
STi i l6uo LE Oe (4) 


On the basis of inequalities (3) and (4) we get 


= ve 3+ py e a 
w = 2nA’B'C= on Tiro T. B,C = 2n A,B,C =l 














Besides, 


, 4 3+ Ho =u 
2B’ + jeooo p, 4 Tte p = 2E 








KH 
and hence all the more so 


2B, 


QB’ < 
Ho 





SH 


480 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


Thus, the conditions of Theorem | are completely fulfilled at 
the point x#’, and 


f: Uza: (x7) Cc U22, (x) z U g(x) (5) 


Ho 





Fig. 58 


For this reason, the Newton process 
g PHD — PT, f (xP) 
where 


E= W(x) (p=0, 1, 2, ...) 


converges to a certain solution «’* of system (1), Sec. 13.3, lying 
in the domain Uza-(x'™). On the basis of formula (5) 
xe Uo, (x) 

Ho 
But by virtue of the note pertaining to Theorem 3 of the prece- 
ding section there is a unique solution x* of the basic system (1) 
in the domain Uzs,(x), and so 

Mo 
x'* = x 

and 


x*= lim xP 
P> v 


which completes the proof. 


Note. If 25, < Æ and p, < 1, then for the initial approximation 
x® there is always a neighbourhood, any point of which can be 


13.7 Modified Newton method 434 


taken for the initial approximation of a Newton process convergent 
to the desired solution, x*. 
Indeed, suppose 
2B, < 2qB,=H 
where q > 1. Setting 


p= max (w +) 


we find that by Theorems 1 and 4 the appropriate Newton process, 
for any initial approximation x’ satisfying the condition 


— po B, 


, 1 
|x (9) __ (0) || < a 





will converge to the solution x* of system (1). 


13,7 THE MODIFIED NEWTON METHOD 


An essential inconvenience in forming the Newton process 
XPD = PWF)  (p=0,1,2,...) (1) 
is the necessity to compute the inverse matrix W~! (xP) for each 
step. If the matrix W~!(x) is continuous in the neighbourhood 


of the desired solution +* and the initial approximation x is 
sufficiently close to x*, then we can approximately put 


Wr-t (x?) x Wo: (x®) 
We thus arrive at the modified Newton process 
P+D = E W(x) f (5P) (2) 
(p=0, 1, 2,...), where %® = x. Note that the first approxima- 
tions x and §® coincide for the processes (1) and (2): 
x) — EW 

The convergence of the modified Newton process (2) has been 

investigated by L. V. Kantorovich [1]. 


Theorem. /f the Conditions (1) to (4) of Theorem 1 (Sec. 13.3) 
are fulfilled and 
By = 2n A,B,C <1 


then the modified Newton process (2) defined by the initial appro- 
ximation §'° = x converges to the solution x* of the system 


SF (x)=0 
and 


[x*—S || S uem |] 2B uk = (p=0, 1, 2, ...) (3) 
where the norm is understood to be the m-norm. 


3] 9616 


482 Ch, 413. Approximate Solution of Systems of Nonlinear Eqs. 


Proof. Consider the vector function 
F (x)= x—Tyf (x) = [F;(*)| 
where T,=W-12(x™), 
Obviously 
FRM=HEP—T SEM) =EY (p=0, 1%.) (8 
Moreover 
F (x)= ET, f" (x) (5) 
whence, in particular, ; 
P(A) SE =f (SHE = 0 (6 
We will prove by induction that all the approximations §” 
(p=1, 2, ...) lie in the 2B,-neighbourhood of the point x, 
that is : 
5°? — 2 |] < 2B, (7) 


Indeed, for p=1, (7) is obvious since by Condition (2) of the 
theorem, we have 


I ra — x!) 1 2 jl x — x) |l < B, 


Now suppose that for some p the inequality (7) is valid. Then, 
using Lemma 2 (Sec. 13.2), we have 


penro = FG) — x = 8 TFG) —2 |] = 
=] F, [F EW (x) (E — x) <P A Ce) | + 
HIE LF (EP?) —F (4) —W (x) EP — 4) } LS 
<B tg AoC I EY 2 Ip 
Using (7), we find 
yer — xe |< B, +5 nAC-4B}= 
=B, + 2nA,B,C-B, = (1+ w) B, < 2B, 


- which proves our assertion. 

Since the conditions of Theorem ! of Sec. 13.3 are assumed to 
be fulfilled, the system f(¥)=0 has a root x* such that 
j x*— x || < 2B, 

Let us consider the difference «*—€'”), where p> l. Taking 
into account that 
| F(x) Tf (x)= 3 
and utilizing Lemma 1 of Sec. 13.2, we have 

|] e*—-8 7) | = |] Pe) FPP) sf et BP |’ (8) | (8) 
where 8 is a point in the interval [x*, 77», 


13.7 Modified Newton method 483 


Furthermore (see Sec. 13.2, Lemma 1, Corollary 2) 
IEF (8) |} =|] E (8) ~ FF’ (xe) || <r] 8— 2 |] max lF” (m)l (9) 


where ņ is a point in the interval [6, «J. From formula (5) 


we have 
; Of s 
F wfo, -È v z| 


where 6,, is the Kronecker delta and T,= [y;}]. Therefore 


n 
OF; Of 5 
Ox: =ô; p? Yis EF 























and 
OF; 
ZF Osp >> Yis iy i 
Consequently 
i BF: af, m) 
LF" (m) || + max py > as Lomax 3/3 Vis Getz, |S 
af, (n) 











cay Yes | x 
i j k=l k=l 


and, hence, on the basis of (9), 
[| F (8) | <1 A,C | 6-2 | 
Since the point @ plainly belongs to the 2B,-neighbourhood of 
ihe point #®, it follows that 
| @— a | <2B, 


Ox; Oxy cae lvis|C=CUP SAC 


and, hence, 

|’ (8) || < 2n A,B,C =p, (10) 
Taking into consideration inequality (10), we derive from inequa- 
‘tity (8) that 

| a*—8 || <p, |] e*—57~ || 
whence 
[=E |] << ph llet || = po |] Pe |< By 
From the last inequality there follows, for u, < 1, 
lim EP = x* 


p> v 


The proof of the theorem is complete, 


484 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


13.8 THE-METHOD OF ITERATION 
Given a system of nonlinear equations of a special kind 


X=, (Xar © + Xn), 
Po o 
Xn = Q, (xi, Xe» ’ Xn 
where the functions g,, 9,, ..., @, are real, defined and continu- 
ous in some neighbourhood œ of af isolated solution (xi, x3, ..., xf) 
of this system. 
Introducing the vectors 
x= (x, Xos very Xn) and P(x) = (p: (x), Pa (x), vse Dr (x)) 
we can write system (1) more compactly: 
x= 9 (x) (2) 
In finding the vector root x*= (xi, x3, ..., xp) of equation (2), 
it is often convenient to use the method of iteration 
HPV = p(x P)  (p=0, 1, 2, ...) (3) 


where the initial approximation x œ x*. The convergence of this 
process will be investigated below. Note that if the process of 
iteration (3) converges, then the limiting value 

5= lim a (4) 

p+ wo 

is definitely a root of equation (2). Indeed, assuming that rela- 
tion (4) is valid, and passing to the limit in (3) as p— oo, we 
will have, by virtue of the continuity of the function @(«), 


a alas i ea 


p> œ poe 


that is, 
l $= (8) 


Thus, § is a root of the vector equation (2). 

lf, besides, all the approximations x” (p=0, 1, 2, ...) belong 
to the domain œ and x* is a unique root of system (2) in o, 
then, clearly, 


=x" 
The method of iteration can also be applied to the general 
system 
f(x)=0 (5) 


where f(x) is a vector function defined and continuous in the 
neighbourhood œw of an isolated vector root x*. For example, 


13.8 Method of iferation 485 


rewrite this system as 


x=x+Af (x) 
where A is a nonsingular matrix. Introducing the notation 
x+Af (x)= (x) (6) 
we will have 
x= (x) (7) 
The ordinary method of iteration (3) is readily applicable to this 


equation. 
‘If the function f(x) has a continuous derivative f'(x) in o, 
then from formula (6) follows 


p (x) =E+Af' (x) 


In the sections which follow, proof will be given that the pro- 
cess of iteration rapidly converges for equation (7) if @’(x) is 
small in norm. Taking this circumstance into account, we choose 
matrix A so that 

p (x0) =E+Af’ (x!) =0 
whence, if the matrix f’ (x) is nonsingular, we will have 
=— [F (x) 
Note that essentially this is the modified Newton process applied 
to equation (5) (see Sec. 13.7). 

Ti det f’ (x®)=0, then one 
should choose a different initial 
approximation «+, 

There are also other ways of 
replacing system (5) by ane: equi- 
valent system (7). 

Example. Use the method of 
iteration. to give an approximate 
solution of the system 


a+e=l, | 


xi—X,=0 J 


(8) 


_ Solution. From a graphical cons- Fig. 59 

truction (Fig. 59), it can be seen 
that the system (8) has two solutions that differ only in sign. 
We limit ourselves to the positive solution. From the figure we 
see that it is possible to take 


en [09 
0.5 





A86 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


for the initial approximation of the positive solution of system (8). 


Setting 
x- x? l 
ce a | 
we have j 
s 2x, 2x, 
FS | aye 
whence 
EE 1.8 i 
P= V9 4g. 4 
and 


det f’ (x) =—1.8—2.43 = —4.23 


Since the matrix f'(x®) is nonsingular, there is an inverse 


matrix 
Le =~ ae | oa 
4.23 }_943 1.8 
Thus 
UF = TL ag 
4,23 |_2 43 1.8 
Put 


IOo 1 Y fxd ext 
ea Wipes a zal ae | 


Then the system (8) will be equivalent to the standard matrix 
equation 


x=@(x) (9) 


Using formula (4), we find the following successive approxima- 
tions for the solution of system (9): 


x7 afa 1] e] 
are i 43 a pe |- 
1 7 [0.060 
=|, s|- aa Be Da 
0.0683] [0.8317 
mS ole | sacl eee 


» = fo's030] — 1 |, l a a | 


I 


I 


0.5630} 4-23 12.43 —1.8] |0.83175— 0.5630 


13.9 Notion of a contraction mapping 487 
0.8317 —0.0049 i 0.8268 
~~ 10.5630 0.0003} |0.5633]’ 
fs: 0.8268 = 0.0007 S 0.8261 
* = [0.5633] — [0.0002] 7 [0.5631] ’ 
p 0.8261 ae 0.0000 = 0.8261 
~ 10.5631 —0.0005] {0.5636 
and so on. 
Stopping with the fourth approximation, we have the roots 


x, = 0.8261, x, = 0.5636 
and 


fos hee 


0.0002 


to within an accuracy of 1078. 


*43.9 THE NOTION OF A CONTRACTION MAPPING 
Let there be given a nonlinear system 


A= (1, Xe, ra ey Xa) 
9, = Pe (x1, Kos EEA Xn)» 


r ©» 8 6» © &» 2 © & 


Yn = On (Xas Xor eee’ Xn) 


where the functions Qi, 9,, --., @, are defined and continuous in 
a known domain G of a real n-dimensional space £, (Sec. 10.1), 
their values (41, Yes ---, Yn) for (Xi, Xo, --., Xp) EG filling some 
domain G’cE,. System (1) establishes a mapping of domain G 
onto G’. Introducing the vectors 


(1) 


x% Yı | P: 
A ‘6 3 I= : 7 ọ= x 
=- Xn Ya Pn 
we can write (1) compactly as j 
y=9(x) (1’) 


We introduce in space E, the canonical norm |||] which satis- 
fies the ordinary conditions. For example, one can put 


l lla = max | x) | 


488 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


or 
le th= 3 
or 
xle=:/ O48 
ll lle V2 
and so on. 


The mapping (1) or (1’) is termed a contraction mapping in the 
domain G if there exists a proper fraction g such that for any two 
points x,, «,€G their images y,=@(x,) and y,=@(x,) satisfy the 
condition 


9: —Iell<a|l 41 —-*, l (2) 
That is, 
leide lr) lgl O<g<}) (2’) 
Let us consider the nonlinear vector equation 
` x= (x) (3) 


which is equivalent to the special kind of nonlinear system of 
equations 


%= (%4, Xos very Xn) f 
Xa = Qe (%1, Xor er>, Xn)» (3’) 
Xn = Dy (Xi, Xqy oe ey Aa) 


The solution x* of this equation, if it exists, is a fixed point of 
the transformation (1). To find x*, construct the iteration process 

xP =g) (p=, 2...) (4) 
where + EG. 

Theorem 1. Let domain G be closed and let the mapping (1) bea 
contraction mapping in G; that is, condition (2) is fulfilled. Then, 
if for the iteration process (4) all successive approximations x® € G 
(p=0, 1, 2,...), it follows that (1) irrespective of the choice of the 
initial approximation x‘ the process (4) converges, i. e., there exists 
the limit 

x= lim xP” (5) 
poe 
(2) the limiting vector x* is the sole solution of equation (3) in the 
domain G; (3) the estimate 


Je — x |< ae Jeo] (6) 
holds true. 


13.9 Notion of a contraction mapping 489 


Proof. (1) To prove the convergence of the sequence of approxima- 


D (p=0, 1, 2,...), we apply the Cauchy test (see Sec. 7.9) 
to ge 


Hl Kieth — xP) |= Il (x P+D. yg) (xP 72) Pt) + coal 
nhc, + (EPP Ptk) I < li XP+ iP l + 
+ || xipt2)__ glpt) I + EENS + || Kieth) glptk-t) Il 
(7) 


Utilizing relation (4) and the “contraction condition” (2), we suc- 
cessively obtain 


| sty — xls) | = | p(x)—o (42) |< 
< g | Loyen Is g? i STD gis- Il < 
<9 I cH x) li (8) 


where s>>0. Therefore, strengthening the right member of inequa- 
lity (7), we get 
| KPTE) gP | <q? | KY — ein Il ae get | xo — «0? | + Tan 

+ gpth-i i xD — x0) | 
or, using the formula for the sum of terms of a geometric . progres- 
sion, we find 


__gptk 
|) xt #) — gim <“ J eD — 2 I<% xno) (9) 


Since Og <1 and, hence, g?-—-0 as p— œ, from formula (9) 
it follows that for any ¢>0 there is an N =N (e) such that for 
p> N (e) and k >00 the inequality 

Jerte 


holds true, that is, for the sequence xP (p=0, 1, 2, ...) Cauchy’s 
test is valid. Therefore, 
x*= lim x”) 
p> o a 
and x*€G since the domain G is closed. 

(2) The vector x* is a solution of equation (3) since, by passing 
to the limit in (4) as p— œ and taking into account the conti- 
nuity in G of the vector function ẹọ (x), we have 

lim xP = @( lim xP») 
that is j ” 
x*= p(x") (10) 

This solution is unique in G. Indeed, let x* be another solu- 
tion of (3): 

x” =9(x") (11) 


490 Ch. #3, Approximate Solution of Systems of Nonlinear Eqs. 


Subtracting (11) from (10), we get 
HP — KN = D(X") — p(x") 
whence 
(e—a || = [|e (¥*) -—@ (*") || <a | e*—x* If 
or 
(1—q) lx — a" || <0 (12) 
Since 1g > 0, then inequality (12) can only hold if |] x*—-x*l| =0, 
that is, when x*=x*. Thus, there cannot be any other solution 
of equation (3) in the domain G. 
(3) Passing to the limit in inequality (9) as k—+oo we get the 
estimate (6). 
This completes the proof of Theorem 1. 
Note 1. If G coincides with the whole space £,, then the condi- 
tion «”)€G (p=0, 1, 2, ...) is obviously superfluous, 
Note 2. Using the inequalities 
| PTL). xl) | < g i xP) — lp) ll, 


ij PTD — gPHD |i < g? Ji KP) PTD ll, 


we obtain from formula (7) 
i xP TR) xv) I <q I yP gp) |+ il xP) gP) -+ 
He HEP <pie e | 
whence, as k— œ, we get 
rx pilae (13) 


In particular, if 0<q<i , then formula (13) implies that for 


I CP) een) | <e 
the inequality 

| 4*—x') || Se 
holds true. 

Under the hypothesis of Theorem 1 it is required that all the 
approximations xP belong to the fixed domain G. This is someti- 
mes hard to verify. For this reason, we will give a slightly modi- 
fied statement of Theorem 1. 

Theerem 2. Let {1) be a contraction mapping in a closed domain G 
and let g be a bounded domain lying in G together with its p-neigh- 
bourhood (in the sense of the norm that has been introduced), 
where 


Dq 
PIG i (14) 


13.10 4st suific. condit. for converg. of iterat. process 491 


Furthermore, let D be the diameter of g and let q be a correspon- 
ding coef ficient in the inequality (2). Then, if 


x! Eg and xV=Ql(x)Eg 


the iteration process (4) converges and the conclusions of Theorem 1 
hold true. 


Proof. It obviously suffices to prove that all the successive ap- 
proximations lie in the domain G: 


MEG (p=2, 3, ...) (15) 


We carry out the proof by induction. 
First of all, by hypothesis, «®€G and «€G. Now let 
KOEG(s=0, 1, 2, ..., p). Since 


i KL) — x) I < D 
‘we have, on the basis of inequality (8), 
so — x | < gs || eP— 2x | Dg (s=0, 1, 2, «++ P) 
Therefore l 
Il x PTL goh I| <]] x nD) | + mae -+ Il PHD __ iP) |l < 


Dg 
<Dq+... + D9? Spy SP 


Consequently, «?tY EG. 
Thus, Property (15) is established and we are in the conditions 
of Theorem 1. The proof of Theorem 2 is complete. 


Note. The conditions of Theorem 2 are equivalent to the follo- 
wing: let «EG and x%€G, where || x? —x || <D. If the di- 
stance of point x” from the boundary T of G does not exceed 


P. then the iteration process (4) converges. 


“43.10 FIRST SUFFICIENT CONDITION FOR THE CONVERGENCE 
OF THE PROCESS OF ITERATION 


Let us consider a real reduced system 


x= Pi (%, Xor sees Xn)» 
X, = P; (X1, Xos a | Xn)s 


>. . e o e o èo‘ ‘œ 


(1) 


or, in matrix form, 


x= (x) (1) 


492 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


where 
ae 1, (*) ] 


L%nJ Pr (x) 
It is assumed that the vector function ọ(x) is defined and conti- 
nuous together with its derivative gp’ (x)= || in a convex, bo- 
unded and closed domain GcE,,. 
In this section we make use of two norms: 


[l a= max |x| 


and 
læ = pa | x;] 
With respect to domain G we introduce the norms 
lo (2) f= max || 9 (2) lla (2) 
and i 
g(x) fh = mag lg (*) Il (3) 
where 
16) n= max X | 2) (2) 
and 
a" (x) u= max | Se (3) 


Ox; 
Theorem, Let the functions ais and g’ (x) be continuous in the 
domain G, and, in G, let the inequality f 
le’ (*)Ih<q<l, (4) 
where q is a constant, hold true. 
If the successive approximations 
xPTD — —( x?) (5) 
(p=0, 1, 2...) lie in G, then the iteration process (5) converges 
and the limiting vector 
x* =lim x”) 


pan 


is the sole solution of system (1) in domain G. 


13.11 2nd suific. condit. for convergence of iterat. process 493 


Proof. By virtue of Theorem 1 of the preceding section, it suf- 
fices to demonstrate that the mapping 
y =9(x) (6) 


is, given Condition (4), a contraction mapping in G in the sense 
_ of the m-norm. 

Suppose x,, X, EG and y;=@(x,) (i=1, 2). 
By the Corollary 1 of Lemma 1 (Sec. 13.2), we have 


[Yi — I: llan = | g (x,)—@ (x) |l < 


<|| Xi — X, |l | p (5) ln < || Xiı— Y, lla | g (x) lh 
whence 
91 — I: lln <q I X,— Xs ||, 


where O<q < 1, which completes the proof. 
Corollary. The iteration process (5) converges. if 


TM: 





HO) | <9, <1 (i=1, 2, ..., n) (7) 


when « €G. 
Obviously, from the system of inequalities (7) follows Conditi- 
on (4) of the theorem. 


Note. On the basis of Theorem 1, Sec. 13.9, we obtain the fol- 
lowing estimate for the approximation xP: 
[a*r |a SHES PO — 2, (P=, 1, 2 «) 


where x — (x), 


“13.11 SECOND SUFFICIENT CONDITION FOR 
THE CONVERGENCE OF THE PROCESS . OF ITERATION 
Before taking up the proof of the convergence theorem that uses the 
‘norm || g’ (x}||,,; let us derive an estimate for the difference of the 
values of the vector function. This estimate is similar to the 
mean-value theorem and is also of interest in itself. 


Lemma. /f the vector function 
f(x) 


494 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


is continuous together with its derivative f’(x)in a convex domain 
containing the points x and x+ Ax, then 


IF (e+ Ax)—F (4) jl, <<] Ax | FB) lk (1) 
where §=x+0Ax and 0O<O< i. 
Proof. Consider the auxiliary function 


O()= 2 e; [Fi (* + (Ax) —f; (*)] 
where O< ¿<l is a scalar argument and e; is a sequence of num- 


bers assuming the values —1, 0, 1. Clearly, ®(0)=0. Applying 
the mean-value theorem, we get 


È e llets) hw) = ()—0 (= 0 6) = 


-Dea 


where E=x+0Axe and 0<6< 1. 
From this we have, noting that |e,{/<1, 


X elh (Arh N< 
<SS) [Ax |= -$ia È 


i=l j= 











a8) | D 








OFS) 
"Ox; 
Since 


ui D a 3 


2 =f" ® lh 


it follows that, by eae ae (2), we get 


n 


ae [f(x + Aw) — TOES AA () l= 














<max Y 
rai 


=F’ 6) led [Ax,|= IF") fle Ax l 


Assuming, in the last inequality, that 
e;= sen [f(x + Ax)—}, (x)] =i, 2, t.e., n) 
we finally get 


Diet 4-H OISIP' OA h 


43.11 2nd suffic. condit. for convergence of iterat. process 495 


Thus, 
IF æ + Aw F e) lh < IA h E S (2’) 
which completes the proof.) i 


Theorem. Lel the vector function p(x) be continuous together with 
its derivative @’ (x) in a bounded convex closed domain G and 


lle(*)Ing<l (3) 
where q is a constant. If x €G and all successive approximations 
PD — p(x) (p=0, 1, 2, ...) (4) 


also lie in G, then the iteration process (4) converges to a unique 
solution of the equation 
x= Q(x) (5) 


in G 
Proof. We will prove that y=@(x) is a contraction mapping 

in G in the sense of the /-norm. ' 

Let x, x, EG and y;=@(x,) (i=1, 2). Using the lemma, we have 


[Y1 — Ya l= I p(x) (x,) |], S I Xa Ka llel g (§) He (6) 
where EEG. 
Since 
Ilo’ yi S max lo (#) [LL =H" (*) ln <9 


it follows from inequality (6) that 
[y — y: l Sq ll — l 


where 0<gqg <1. 
By the Theorem of Sec. 13.10, the proof is complete, 


Corollary. The process of iteration (4) converges to the sole so- 
lution of equation (5) if the inequalities 
5 Be (x) 
= OX; 


(i=1, 2,..., n) hold true for x €G. 





S95 <1 (7) 


Ð Jf we apply the mean-value theorem directly to each component of the 
vector f («+ Ax)—Jf (x), then the resulting estimate is dependent on the va- 
($i 


lues of the derivative at various points Ẹ; (i=1, 2, ..., n) of the in- 


terval (x, x-} Ax). Inequality (2’) shows that we can confine ourselves to the va- 





lues -of the derivatives ae at one and the same point BE (x, «+ Ax). 
J 


496 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


Note. On the basis of the theorem of Sec, 13.10, we have, for 
the approximation x”, the following estimate 


qP 


Jae < 





xtxo 


where x = p(x). 


*43.142 THE METHOD OF STEEPEST DESCENT 
{GRADIENT METHOD} 


Suppose we have a system of equations 


fi (4, Xz» werg Xn) =0, 
fe (%4, May very Xn) =0, 
A E E E dt ee (1) 
Fn (Xis Xe s X,) =0 
or, in matrix form, 
f (x) =0 (2) 
EA 
where f=| f, 
Le fa 


Let us suppose that the functions f; are real and continuously 
differentiable in their common domain of definition. Consider the 
function 


U(x) = Bite) F=(F (a), F) B 
It is obvious that each solution of system (1) reduces the func- 
tion U (x) to zero; contrariwise, the numbers x,, X, ..., x, for 


which the function U (x) is zero are roots of the system (1). 

We will assume that the system (1) has only an isolated solu- 
tion which is the point of the strict minimum of U(x). The 
problem thus reduces to finding the minimum of the function U(x) 
in the n-dimensional space E,=—{x,, Xa, ..., Xn 

Let x be a vector root of (1) and let x be its zeroth appro- 
ximation. Through the point x draw the level surface of the 
function U(x). If point « is sufficiently close to the root x, 
then, under our assumptions, the level surface 


l U (x)= U (x0) 
will resemble an ellipsoid, 


13.12 Method of steepest descent [gradient method) 497 


Start out from point x‘ and move along the normal to the 
surface U (x)=U (x) until this normal touches some other level 
surface (Fig. 60) 

U (x) =U (x) 
at some point +”. 

Then, starting from point *”, move again along the normal 
to the level surface U(x)=U (x™) until this normal touches at 
some point x, the new level surface 
U (x) =U (x), etc. 

Since U (x) > U (x®) > U(x) > ..., 
by following this route we rapidly approach 
the point with the smallest value of U 
{the bottom of the “well”) which corres- 
ponds to the required root x of the system 
(1). Denote the gradient” of the function 
U (x) by 

au 
“Ox, 


VU (x)= 
ðU 
OX n J 








From the vector triangles OM,M,, Fig. 60 
OM,M,, ... we conclude that 
LPD = XM _ALWU (xP) (p=0, 1, 2, ...) 
It remains to determine the factors 4,. To do this, consider the 
scalar function 
D (M) =U [x —AVU (”)] 


The function ® (A) gives the variation of the level of the func- 
tion U along the appropriate normal to the level surface at the 
point x”. The factor A=A, must be chosen so that @ (A) has 


1) The gradient of the function U (x) (it is denoted by grad U or WU, where 
the symbol y is called nabla or del) is the vector applied to point x and ha- 
ving the direction of the normal z to the level surface of the function at the 

ðU 
given point towards increasing U and having length equal to ae 

The following formula is valid: 

ðU 


wu Ox, 





ou ou 
etg Tt oT Re en 
where e; (i=1, 2, ..,, n) are the unit vectors of space Ep 


32 9616 


498 f Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


a minimum. Taking the derivative with respect to A and equating 
it to zero, we obtain the equation 


D (4) = 32 U [xP —AVU (x)] =0 (4) 


The least positive root of equation (4) yields the value of A,. 
Equation (4) must, generally speaking, be solved numerically, and 
so we give a method for approximating the numbers A, We will 
assume that à is a small quantity, the square and higher powers 
of which may be neglected. We have 


D= Ý (h [xP ANU (xP) 


Expanding the functions f; in powers of 4 to within linear terms, 
we obtain : 


© 0) =D [fe (xe) —2 EE gu (ey) 


i=] 


where 
Oe |e ft p] 
; Ox | Oxy’ Ox,’ °°"? Ox, 
Whence 
n „(xí 
@ p=) [jea EE gujo] x 
i=l 
| xe WU (x) =0 
Thus 


4 OF: (xk 
f: (xP) gu (xP) 
=I] 7 











Ap =~ (FP), W (xP) yU (x) 
r fa WR) Uee), Wer) yU WP) 
ie) 7 (P) 
D |e wu (x) |" 
ist 
where 
fsa 2h dh 7 
OX, OX, OXn 
Of fe Ofe 


vi) OX, Oxa `"? Ox, 


Fn Ofn fn 
LOX, OX, 7 Olp 
is the Jacobi matrix of the vector function f. 





43.12 Method of steepest descent (gradient method} 499 


Furthermore, we have 


a = p [Fe (ph =23 s hi) = 


whence 


[oA 
OF; x 
p> a) i; (*) 


YU (= ake ok = 2W" (x) f (x) 


Li= 
where W” (x).is the transpose of the Jacobi matrix. 
And so, finally, 
(fP, WW F 
Bp M=- WFO, w VT (5) 


where, for brevity, 
FP =f (xP), W, =W (x) 
and 
XPD =KP— UWF? (p=0, 1, 2, ...) (6) 


. If we assume that the function f (x) is twice continuously diffe- 
rentiable in the neighbourhood of the desired root x, then we can 
obtain more precise formulas for the corrections 


AXP = xP) xP) (see [7]) 


Example. Use the method of steepest descent to approximate the 
roots of the system 


y—y? +3xz = — 0.2, 


x-+ x?— 2yz=0.1, 
z+ 27+ 2xy = 0.3 


located in the neighbourhood of the origin. 


Solution. We have 
0 
x=} 0 
0 


500 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


x+ x? —2yz—0.I 
f=|y—y + 3xz+ 0.2 
z+ +2xy— 0.3] 
l1+2x —2z —2y 
W =| 3z 1—2y 3x 
2y 2x 1+2z 
Substituting the zeroth approximation, we obtain 


—0.] 1 0 Ol 
f=} 0.2] and W,=]0 1 ae 


Here 


and 


—0.3 0 0 1 
Using formulas (5) and-(6), we get the first approximation 
(0), fo 
He pF =| 
and 

0.1 

4) = x]. EF = | —0.2 
0.3 


Analogously, we find the second approximation «+. We have 


_ To.13 1.2 —0.6 0.4 
fP=10.05 j, W,=| 0.9 14 0.3 


0.05 —0.4 0.2 1.6 
whence 
0.181 
Wif ‘Y= | 0.002 
0.147 
and 
0.2748 
WW f= | 0.2098 
0.1632 
Hence 


__ 0.18-0.2748 4-0.05.0.20984-0-05-0.1632 _ 0.054374 
i 0.2748? + 0.2098? + 0. 16322 ~ 0.14619797 


0.1 0.181 0.0327 
x = | —0.2 | —0.37119. | 0.002 | = | —0.2007 
0.3 0.147 0.2453 


= 0.3719 


and 


13.13 Method of steepest descent for syst. of lin. eqs, 


501 
To check, we compute the residual 
0.032 
f=|—0.017 
—0.007 
13.13 THE METHOD OF STEEPEST DESCENT FOR THE CASE 
OF A SYSTEM OF LINEAR EQUATIONS 
Consider the system of linear equations 
ts 
A= à A, jx; —b =0, 
= 2 a,,x,—b,=0, 
Ja pà 2j^j 2 (1) 
f= pa Q,,;X;—b, =0 
with real matrix A = [a,,] and the column of constant terms 
pee 
b, 
b=] ` 
Ln 
Then 
f=Ax—b 
and 
Q Ay, Ain 
Any Ang Ann 
Consequently - 
; XPD = xP) AT, (2) 
where r,= Ax”)—6 is the residual of the vector xP’ and 
_ (fp, AAT) R ; 
Mp (Airp, AAT) (p=0, 1, 2, ...) (3) 


(cf. [5], [6]). 
Use of formulas (2) and (3) leads to unwieldy computations. 
For this reason, one simply uses “descent” instead of “steepest 


502 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


descent” in practical situations, striving towards the minimum of 
the function 
U=(Ax—b, Ax—b) 

Here, generally speaking, the number of steps in the process 
that ensure a given accuracy of the roots of system (1) increases. 
But it is possible to achieve a situation in which the computation 
of each step is simpler. 

In the general statement of the problem, we assume 


KPID = xP), yy?) (p= 0, 1, 2, oa) 
where y) is an arbitrary vector directed to the outside of the 
level surface’ U = const passing through the point x’, i.e., 
(grad U (xP), yP) >0 
We have 
Fp = AXP b = AXP bA Ay?) =r,—h Ay” 
A possible way of determining the scalar factor 4, proceeds 
from the requirement [7] ' 
(For IYP)=(Fp YP)— h (AP, yP)=0 
whence 
(Fp, YP) 
= (Ay®, yP) 
Depending on the choice of the vector y, a variety of com- 
putational schemes result. In particular, if the matrix A=A’ is 


hp 


positive definite (Sec. 10.15), then, putting yP =r, we have 
(Tp, Tp) 
( eh zA PCPR 
xP th == xP Ur, tp) A Fo 


(p =0, 1,2, ...), and (grad U (xP), y'?) =2 (Arp, r,) > 0 forr, #0. 


Example. Use the method of steepest descent to solve the system 
of equations 


8x,— x,— 2x,= 2.3, 
10x,+ x+ 2%,= —0.5, (4) 
—%x,+6%,4+ 2x, = — 1.2, 

3x —%,+2x,+12x,= 3.7, 

Solution, Since the matrix of the system is dominated by the 
diagonal elements, for the initial vector * we take the vector 
whose coordinates are rounded values of the roots of the system: 

8x,= 2.3, © 6x,=—1.2, 
10x,=—0.5, 12x,= 3.7, 


13.13 Method of steepest descent for syst. of lin. eqs, 503 


Then, for example, 


0.3 
—0,05 
(oy __ 
eos 02 
0.3 
Hence 
8 —1—2 O1F 03 2.3] [0.55 
0 10 1 2/{|—0.05 —0.5| |0.4 
— Ax _be a= 2 
ro AxX—b=)_ 1 o & 2|1—02 —12/710.3 
3—1 2490), 0.3 3.71 10.45 
Furthermore 
8 0—1 3770.55 5.45 
ap |} 10 9 -1)/0.4 |]3.0 
To= o 1 6 20.3 2.0 
0 2 2 12}{0.45 6.8 
and 
8—1 —2 0975.45] [36.6 
„| 0 10 1 2||30 |_| 45.6 
BAT a BS 9180 1} 20618 
| 3-1 2 12/168) [98.95 


Using formula (3), we get 
_ (fo AA'To) __ 
Ho = TAA AA's) 
__ 9.55-36.64-0.4-45,.64-0.3-20.15+0.45-98.95 _ 88.9425 


36.6- 45.6" +00. 15? + 98.95? =e oss ~ - 006532 
whence l 
0.3 5.45 0.2644 
—0.05 3.0 —0.0696 
(1) — gy tp — _ ) B 
V= 4" WA To —0.2 0.006 532 T Eo 
0.3 6.8 0.2556 
and 
0.3109 
0.1020 
a) D fp — 
il aia al O SA 


—0. 1966 


504 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


Subsequent approximations and the corresponding residuals are 
found in a similar manner: 





0.2351 0.0956 
a | —0.0849 0.0087 
P | 
—0.2147 |’ 3 0.2493 |’ 
0.2863 0.0967 
0.2296 fF 0.0712 
es —0.0842 — 0.0280 
xr) = = 
—0.2251 |’ =| 0.1692 |’ 
L 0.2748 —0.0806 
0.2266 0.0680 j 
piss | 70.0792 „ | 0.0354 
~ | —0.2379 |’ + 0.1211? 
0.2875 0.0334 
0.2228 r 0.04937 
— 0.0810 0.0013 
x= r.= . 
— 0.2430 |’ 0.0839 
0.2823 | ~0.0493 


and so on. 

It will be seen that the approximation process converges slowly 
in this case: after the fifth approximation, we are still a good 
distance away from the exact roots of system (4), which are 
xX, =0.2, x, =—0.1, x,=—0.3, x= 


“43.14 THE METHOD OF POWER SERIES 
Suppose there is a nonlinear system 


Fr (kis Xas sers Xn) = 0 (1) 
(k=1, 2. ..., n), where the functions f, are analytic in the 
neighbourhood of the isolated solution #*= (xi, x2, ..., xn). 
Let us consider the more general system [8] 
Fy (X15 Xp) +++, Aa AY=O (2) 
(k=1, 2, ..., n) which depends on a real parameter A and is 


such, that for A=0 the system (2) is solved directly, while for 
à=1 it is identical to system (1), i. e., 

Fp liy Ag wees Sa Des teh eo 
(k= 1, ..., n). The parameter à should be introduced so that 


the seeds of the functions F, on A is as simple as possible. 
For instance, if x% = (x1, x, ..., x4) is a rough approxima- 


13.14 Method of power series 505 


tion to the solution, then we can is) 


n 


j=l 
(k=1, 2, ..., n), where 
x= (%, x v ee’ Xn) 


We will assume that F, are analytic functions of à for |A|<1. 
Suppose that for |A|<1 the system (2) has a simple analytic 


solution x, i?) (j=1, oe .»+, m), which for à=1 coincides with 
s= l, 2) 06 n). . 
soap G=1, 2, t.y n) 
where x® (j=1, ..., 2) is a known solution of system (2) for 


A1=0. cpanding the functions x;(4) in a Taylor series at the 
point A=0, we obtain 


x, (a) =x) (0) F Ax; (0) +E (0+... G=, 2.) (3) 


To determine the coefficients x; (0) differentiate (2) with respect 
to the s À: 


OF aoe 
Soe a )+ ae =0 eat 2, no (4) 


Assuming x =x and A4=0, we have 


Z OF, (x): 0) _, OF, (x; 0 l 
MD go OA at, B vane 


whence if 
OF, (x; 0) 
d eel 2 AA 0 
et | a \* 


we find xj (0). i 
Furthermore, differentiating (4) with respect to 4, we obtain 


Sega) +B De SP (4) 25) + 


Fp OF, 
+25 Daai * mA + a= 


e 


506 Ch. 13. Approximate Solution of Systems of Nonlinear Eqs. 


whence, for x= x and 74=0, we find 


5 a dD x (0)=— > 5 2 0) x (0) x; (0)— 


j=! j=ll=} 


5S OF p (x; 0) x); 0) ¢ (0) = 2? (k=l, ENA n) (5) 





ee an 
Since x'(0) are known, from system (5) it is possible to deter- 
mine x; “ (0). The derivatives x” (0), x!¥ (0), ... are computed in 


a similar manner. 

Note that the matrix of coefficients of the higher-order deriva- 
tives proves to be the same all the time and to be equal to the 
Jacobi matrix of the functions F,, F,, ..., F, with respect to 
ie variables x,, x, ..., X, for x,=xP(j=1, 2, ..., n) and 

=0. 

Assuming that the series (3) converge for 4=1, we finally get 


“p= x,(l)=x,(0) +2) (0) ta FO) +... =l, 2%... 0) (6) 


A defect of the method is the complexity of the computations 
in the general case of higher-order derivatives. What is more, the 
convergence of series (6) may not be fast enough. 

When ang the method, one need not assume the functions 

(A)(f=1, 2, ..., n} to be analytic; namely, in place of the 
te series, one can take advantage of Taylor’s formula, termi- 
nating the series x;(4}) with some power A’ and estimating their 
remainders by familiar formulas (Sec, 3.4). 


REFERENCES FOR CHAPTER 13 


[1] L. V. Kantorovich, On Newton's Method, 1949 (in Russian). 

[2] Alexander Ostrowski, Sur la convergence et l'estimation des erreurs dans 
quelques procédés de résolution des équations numériques, 1940, 

[3] James B. Scarborough, Numerical Mathematical Analysis, 1955, Chapter IX. 

[4] D. A. Ventsel, E. S. Ventsel, Elements of the Theory of Approximate Com- 
putations, 1949, Chapter 111, Sec. 8 (in Russian). 

[5] estos E. Milne, Numerical Solution of Differential Equations, 1953, 

hapter 9. 

[6] Alston S. Householder, Principles of Numerical Analysis, 1953, Chapter 3. 

-[7] E. But, Numerical Methods, 1959 (in Russian). 

[8| Edwin F. Beckenbach (editor), Modern Mathematics for the Engineer, First - 
Series, 1956, Chapter 16, Nonlinear Methods by Charles B, Morrey, Jr. 


Chapter 14 
THE INTERPOLATION OF FUNCTIONS 


14.1 FINITE DIFFERENCES OF VARIOUS ORDERS 
Suppose - 
y=f(x) 
is a given function. We denote by Ax=h the fixed value of the _ 


increment in the argument (interval or spacing). Then the expres- 
sion 4 i 


Ay = Af (x)=f (x+ Ax)— f (x) (1) 


is called the first finite difference of the function y. Finite diffe- 
rences of higher orders are defined in similar fashion: 


A'y = A (A" 71y) (n=2, 3, ...) 
For example, 
Ay =A [f (x+ Ax)— F (*)] = IF (x + 24x) — F (x + Ax) — 
— [f («+ Ax) —F (*)] = f (x + 24x) —2F (x + Ax) +f (x) 
Example. Construct finite differences for the function 
P (x) = 4% 
taking the interval as Ax=1. 
Solution. We have 
AP (x) = (x+ 1)? — x8 = 3x24 3x41, 
APP (x)= [3 (x+ 1)? +3 (x+ 1) +4 1] — (3x? + 3x + 1) = 6x +6, 
AP (x) = [6 (x+ 1)-+ 6] — (6x-+6) =6, 
A"P (x)=0 for n >3 
Note that the finite difference of order three of the function P(x) 
is constant. ” 


508 Ch. 14. The interpolation of Functions 


Generally, the following assertion holds true: if 
P(x) = a,x" +a,x7-I+...4+4, 
is an nth degree polynomial, then A” P, (x) =n! ah" = const, where 


Ax=A. 
Indeed, we have 


AP, (x) =P, (x +4)—P,, (x) = a, [(x +h)" —x"] + 
+a [(x+h)?-t—x* 4... 4a,_,[(x +—Ah] 
Expanding the parentheses by the binomial theorem, we readily 
see that AP, (x)-is a polynomial of degree n— 1: 
AP, (x) = b,x" 240,47? + ...40,_5 
where 
b, = nha, 
Reasoning in a similar manner, we conclude that the second di- 
ference A?P,,(x) is a polynomial of degree n—-2: 
APP, (x) = ex" F+ex"-F+...+6,_, 
and 
Cy = (n—1) Ab, =n (n— 1) ka, 
Continuing in this manner successively, we finally establish that 
A? P, (x) =nla,h” = constant 


As a corollary we obtain 
ASP, (x)=0 for s>n 
The symbol A (delta) may be regarded as an operator that 
associates the function Ay=f(x-+-Ax)—f (x) (Ax constant) with 


the function y=f (x). It is easy to verify the basic properties of 
the operator A: 


(1) A (u+v)= Au + Av, 

(2) A (Cu)=CAu (C constant), 

(3) A” (A"y) = A” tny 
where m and n are nonnegative integers, and A°y=y by defini- 
tion. 

From the formula (1) we have 


F (e+ Ax) = f (x) + Af (x) 


44.4 Finite differences of various orders 509 


whence, regarding A as a symbolic multiplier, we obtain 


Fe+Axy=(1+A)f (x). (2) 

Applying this relation n times in suecession, we have 
f(x nAx) = (1+ A)"f (x) (3) 
Taking advantage of the binomial formula,” we finally derive 
f(x+ndx)= D Cran (x) (4) 


where 

m _ a(n—1)...[n—(m—1})] 

Cr = m! 
is the number of combinations of n elements taken m at a 
time. 

Thus, with the aid of formula (4) the successive values of the 
function f(x) are expressed in terms of its finite differences of va- 
rious orders. 

Using the identity 


An pA (5) 


and applying the binomial theorem, we obtain 


A"F (x)= [0 + A)—1]" F = (1-4 A)" f x) Ca + AF (x) + 
HCR -HANTI (x) —. + (—1)"F (*) l 
whence, by formula (3), we get 
A"f (x) =f («+ nx) —~CHf [x4 (n— 1) Ax] + 
+ Caf [x 4-(n—2) Ax} —... 4-(—1)"F (x) (6) 
Formula (6) expresses the nth-order finite difference of the fun- 
ction f(x) in terms of successive values of the function. 
Suppose f(x} has a continuous derivative f(x) on the interval 
[x, x+nAx]. Then the following important formula holds true: 
A"f (x) = (Axy"f (x+ On Ax). (7) 


where 
0<6<1 


It is left to the reader to substantiate the legitimacy of using the bino- 
mial formula, 


510 Ch. 14. The Interpolation of Functions 


The easiest. way to prove (7) is by mathematical induction. 
Indeed, for n=1 we get the mean-value theorem and, hence, 
formula (7) is true. Now, for &<n, we have 


AFF (x) = (Ax)FP (x + ORAK) 
where : 


0<0' <1 
Then 


AMP (x) = AF [F (x AF (x) = 
= (Ax) = (x 4- Ax + 6 RAx) —f (x-0 kAx)] 


Applying tha mean-value theorem to the resulting increment of the 
derivative f(x), we have 


ARF (x) = (Ax) Arf (x 4- O'RAx-+ 0"Ax) 


where 0< 0" < 1. Assuming 


62-0" _ 
ria ae (8) 





we finally get 
AM AP (x) = (Axt 440 (x +0 (k+ 1) Ax) 
and, obviously, 
0<90<1 


Thus, the transition from k to k-+1 is established and, hence, 
formula (7) is proved. 
From formula (7) we have 


fF (x+ 8nAx) = A soe qe 


Then, passing to the limit as Ax-+0 and assuming that the de- 
rivative f(x) is continuous, we obtain 





ae 
f(x) = lim ara (9) 


Thus, for small Ax, the approximate formula 





O A 
Po (x) m ER (10) 


is valid. 


14.2 Difference table 511 


14.2 DIFFERENCE TABLE 


One often has to consider functions y =f (x) specified by tabular 
values y;=f (x;) for a set of equidistant points x,(i=90, 1, 2, ...), 
where 

AX; = Xi — X; = h = constant 


The finite differences of the sequence y; are naturally defined by 
the relations 

AY = Yi Yi 

Ay; = A (Ay) = AY iri — Ai» 


. 
e. . o è» ç ç» o »osn p‘ 


A" y= A (A771) = AP IY; 4,— Ay 
From the first equation we have 


Yis =Y; + Ay; = (1 +A)y, 
whence we derive in succession 


Yit = (1+ A) Yin = (1+ AF yi, 
Yia = (1+ A) Y = (14+ A)? y; 


Yisa = (1+ A)" y; 
Using the binomial theorem, we obtain 
Yian = Yr +H CrAy;+ CAY +... + Ay, 
Conversely, we have 


A"yi= [0A "y= (1+ A)ys—CH(1 + AY ty, + 
+UA- EH D'y; 


or 
A'Y; = Yn+i— ChYnti-i H Cantina -o -H (Hy, 
For example, 


AY = Yita Yi H Yp 
AY: = Yita — BY ite H 39 j41 Yi 


and so on. Note that to compute the nth differerice A”"y;, one has 
to know n+1 terms y;, Yj11, -+< Yi+n Of the given sequence. 


512 Ch. 14. The Interpolation of Functions 


Finite differences of various orders are conveniently arranged in 
the form of two types of tables: a horizontal difference table (Tab- 
le 33) or a diagonal difference table (Table 34). 


TABLE 33 
HORIZONTAL DIFFERENCE TABLE 





TABLE 34 
DIAGONAL DIFFERENCE TABLE 





Example 1. Form a horizontal difference table for the function 
y = 2x8 —2x7 4 3x—1 (1) 
from the initial value x,=0, using h= i as the interval, 


_ Solution. Assuming x,=9, x;=1, x,=2, we find the correspon- 
ding values y,=— 1, y,=2, y,=13, whence we get 
Afs = 41-4 = 3, 
Ay, =¥,—-4, = 11, 
A’y, = Ay, — Ay, =8 


14.2 Difference table 513 


These values are entered in the table (Table 35). Since our func- 
tion is a polynomial of degree three, the third difference is con- 


TABLE 35 


HORIZONTAL DIFFERENCE TABLE 
OF THE CUBIC FUNCTION (1) 











stant (see Sec. 14.1) and equal to 
A®y, = 2-3! == 12 
And so the rest of the table (Table 35) can be filled in by means 
of summation, using the formulas 
Ay = Ay, +12 (i=0, 1, 2, ... 
Ayj;+,= Ay; + Ay; (i=1, 2, ...), 
Yi+ı = Yi t AY; (f= 2, 3 
The stepwise broken line indicates the initial data for compiling 
the table. 


Mote, Random errors of the computor may crop up when compi- 
ling a difference table. Let us see how the error e in the value 
of y, affects the values of the differences. Compiling the approp- 
tiate diagonal difference table, we get Table 36. 


From Table 36 it is seen that: (1) if y, contains an error, then 
also do the differences 


AYn-1 Ayn} NY a A’y,, 


and so on; (2) errors enter the kth differences A*y with binomial 
coefficients of alternating sign, namely, the errors have, respecti- 
vely, the values 


Coe, — Cle, Cie, ..., (—1)*Cke 


33 9616 


514 Ch, 14. The Interpolation of Functions 


, TABLE 36 
DIAGONAL DIFFERENCE TABLE OF THE CUBIC FUNCTION (1) 


AYngtE 


D Ypa- 3E 


A ypt BE 








and, consequently, the absolute value of the maximum error of 
the Ath difference grows rapidly with the number of the differen- 
ce; (3) for each finite difference A*y, the sum of the errors (with 
regard for sign) is zero, while the sum of the absolute values of 
the errors is equal to |e|-2*. Thus, even a slight error in the va- 
lue of the function leads to considerable errors in high-order dif- 
ferences. Note that in the case of a diagonal difference table the 
maximum error of the differences A*y lies in the same horizontal 


14.2 Difference table 515 


row asthe erroneous tabular value y,, or in the adjacent upper 
and lower rows. 

This propagation law of the e-error in difference tables makes it 
possible in certain cases to detect and locate an error and also to 
find its numerical value, thus enabling one to rectify it. 

Difference tables are ordinarily compiled to an accuracy of some 
fixed decimal place. If the function y=f(x) has continuous deri- 
vatives up to the mth order, then, given a sufficiently small in- 
terval h= Ax, its differences up to the mth order inclusive vary 
smoothly, and the mth difference is nearly constant within the 
limits of the given decimal places. Any violation of this condition 
in some section of a table generally indicates a computational er- 
ror (if the function is without singularities). 

Having established the maximum deviation of the mth difference 
from the norm, one can locate the error in the column of values 
of the function y on the assumption that (1) the error is single 
and consists in an erroneous computation of one value of the fun- 
ction, and (2) no new errors were made in computing the finite 
differences. If such an error is detected in a difference table, it 
can be rectified with the aid of difference values. We show how 
this is done and for the sake of simplicity confine ourselves to 
the case of constancy of second or third differences. 

Suppose the erroneous tabular value is y, +, where the subscript n 
has been established and the magnitude of the error e is unknown. 

If the third differences are practically constant, then the second 
differences form an arithmetic progression, and for this reason the 
true value of the second difference A?y,_, will be equal to the 
arithmetic mean of three successive erroneous differences: 


Ayni = (Ayn +8) + (Ayn 128) + (AY, +2)] 


since the terms in e cancel out. 

Using the true value of the second difference A’y,_, thus found, 
it is possible to find the magnitude of the error e; namely, this 
error is equal to one-half the difference between the corrected and 
erroneous values of the difference Ay, _,: 


} 
e= 4 [A%y,-1—(A%Y,-1—28)] 


Then the true value of the function y, itself can be found from 
the identity 


Yn = {Yn +8)—e 


As a check, compute the differences once again. - 


$16 Ch, 14. The Interpolation of Functions 


Example 2. Correct the error in Table 37. 


TABLE 37 
DIFFERENCE TABLE CONTAINING AN ERROR 











x | y Ay A’y Error 
15 13.260 
884 
16 14.144 0 
884 
17 15.028 .. 0 
884 g 
18 | 15.912 | (4) 0 
88(0)4 
19 16.79(2)6 (8) 0 —2 e 
88(8)4 
20 | 17.680 (—4) 0 E 
` 884 
21 18.564 0 
884 
22 19.448 0 
884 
23 20.332 


Solution, Here, the smooth course of the second differences beco- 
mes most irregular for x= 19. The error affects three rows indica- 
ted by the brace. Let us find the arithmetic mean of the second 
difference for the middle one of the three rows 


_ 10-8 


1 == (448-4) =0 





Ayn- 


whence 


e= + [0 — 0.008] = — 0.004 


Correcting the tabular value of y for x= 19, we get 
Yn = (Y, + E)— Ee = 16.792 —(— 0.004) = 16.796 


After the correction, we get a table with regular variation of the 
first differences and a constant second difference (the erroneous 
digits are given in brackets). Bear in mind that this method is 
good only for correcting isolated computational errors or copying 
errors. There are special “smoothing” techniques [1] designed to 
eliminate large numbers of errors that may appear for a variety 
of reasons, and also to diminish any accumulation of errors re- 


14.3 Generalized power 517 


sulting from inaccuracies of the computational methods themselves 


and irom rounding off intermediate results to a given number of 
decimal places. 


14,3 GENERALIZED POWER 

In the sequel we will need a concept which we will call a ge- 
neralized power (see [1] where the term ‘factorial’ is used for this 
notion). 


Definition. The generalized nth power of a number x is a pro- 
duct of n factors, the first of which is equal to x, and each sub- 
sequent one is A less than the preceding: 


x™ <= x (x —h) (x— 2h)... [x—(n—1) A] (1) 
where A is some fixed constant. 
The exponent of a generalized power is enclosed’ in square bra- 


ckets. It is agreed that 7 = FH. 
For h=0, the generalized power (1) coincides with the ordinary 


power: 
x y" 
Let us compute the differences for a generalized power, assuming 
Ax =A. For the first difference we have 
Axi) = (x+ h) ™ — xt) = 
=(x+h)x.. ee h]—x(x—h)...[x—(n—1) A] = 
=x (x—A)...[x—(n—2) A] -{(x+h)— [x—(n— 1) A] = 
=x(x—h)...[x—(n—2)h] nh = nhx" 7» 
that is 
Ax™ = nhx™ Y (2) 
Compute the second difference: 
A?x™ = A (Ax™) = A (nhx™ 8) = nh-(n—1) hx™ = nh? (n — 1) x=» 
Thus i 
Axm =n (n— 1) hyh- 
Using the method of mathematical induction t is easy to prove 
the general formula 
Aéx') = n(n—})...]n—(R—1)] hex 
where k=1, 2, ..., m. 
It is obvious that 
A‘xy'4—0 for k>n 


518 Ch. 14. The Interpolation of Functions 


From formula (2) there also follows a simple formula for finite 
summation. Let 


Ravicdag Nase aioe 


be equally spaced points with interval A: 
Xia 45h (i=0, 1, er 


Consider the sum 


N-1 
Sy= > xy 
&=0 
Since, by formula (2), we have 
gai AD 
~~ hint} 


it follows that 


[n+] 
=r £ Axl 


Bi n E E E aera 8) Ge 
[n +13 [+1] 
= apy ee) 


Thus 


atil in 


ed 
u* Shary (3) 


Formula (3) is similar to the Newton-Leibniz formula for a 
positive integral power. 


14.4 STATEMENT OF THE PROBLEM OF INTERPOLATION 


The simplest problem of interpolation [2] consists in the follo- 
wing. On an interval ja, 6] are specified n+ 1 points x,, x,,...,x,, 
called mesh points (interpolation points), and the values of some 
function f(x) at these points: 


P(%)=Yo FS, -- AO (1) 


It is required to construct a function F (x) (interpolating function) 
belonging to a known class and assuming the same values at the 
interpolation points as f(x), that is, such that 


F(a) =h, Ea) =Y ---> E ($n) = Yr (2) 


14.5 Newton's first interpolation formula 519 
Geometrically, this means that one has to find a curve y =F (x) 


of some specific type that passes through the given set of points 
M,(%;, Yi) (§=0, l, 2, meni) (Fig. 61). 


Fig. 61 





In such a general statement, the problem can have an infinity 
of solutions or none at all. However, the problem becomes unam- 
biguous if in place of the arbitrary function F (x) we seek a poly- 
nomial P„(x) of degree not higher than n that satisfies the condi- 
tions (2); that is, such that 


Pa (4o) = Yos P, (%1) = Yas siy Pa (2n) = Yn 


The resulting interpolation formula 
y =F (x) 


is ordinarily used to approximate the values of the given func- 
tion f(x) for values of the argument x that differ from the inter- 
polation points. This operation is called interpolation of the func- 
tion f(x). We distinguish between interpolation in the narrow sense 
when x€[x,, x,], that is the value of x is intermediate between 
x, and x,, and extrapolation, when x [x,. x,]. In the sequel we 
will use the term interpolation to mean both interpolation in 
the narrow sense and extrapolation. 


14.5 NEWTON’S FIRST INTERPOLATION FORMULA 


Suppose we have a function y=f(x) and are given the values 
y=f(x) for equally spaced values of the independent variable: 
X,=xX,+th(i=O0, 1, 2, ..., n), where h is the spacing (interval). 
It is required to find a polynomial P,, (x) of degree not higher 
than n and assuming at points x; the values 


P(x) =y; (i=0, l, syag n) (1) 
Conditions (1) are equivalent to 
APP, (x0) = AMY, 
for m=0, 1, 2, ..., M, 


520 Ch. 14. The interpolation of Functions 


Following Newton, we seek the polynomial in the form 
P, (%) =a) + a (*— x.) +a, (¥— Xp) (¥—%) + 
+ as (x— x) (x—x;) (x—x,) + 
+... +a, (X— xo) (x= x1). (X— Xn) (2) 
Using the generalized power, we write expression (1) as 
P(x) =a, +a, (x—x,)"! +a, (x—x,)! + 


+a, (x =x)! o +4, (x—x,)™ (2’) 
Our problem consists in determining the coefficients a,(i=0, 1, 
2, ..., n) of the polynomial P,(x). Setting x=x, in (2’), we 


obtain 
: Pa (Xo) = Yo = Ay 
To find the coefficient a,, form the first difference 
AP, (x)= ah + 2a, (x—x,)UA+ 

+ 3a, (4—4) P h+... + na, (x—x,)"-UA 

Putting x=x, in this expression, we get 
AP, (5) = Ayy = ah 

whence 


— AM) 
lh 


ay, 

To determine the coefficient a,, form the second difference 
AP, (x) = 2a, + 2: 3ha, (x —x,) + 
+ ..4(n-—-1) nha, (x—x,)"-4 
Putting x=x,, we get . 
APP, (%) = Ay = 2th?a, 

whence 
A®y, 
2! A? 





a, = 
Continuing this process successively, we find that 
Ai ‘ 
a= ss (@=0, 1, 2, ..., 1) 


where 
=l and A°y=y 


Substituting the values of the coefficients a, thus found into 


14.5 Newton's first interpolation formula 521 


expression (2’), we get Newton’s interpolation polynomial 


A A? 
P, (= t Fey Aa) Ft rA (x— Xp) + 
A” 
4... Fo (x— 4) (3) 


It is easy to see that the polynomial (3) fully satisfies the requi- 
rements of the problem. Indeed, firstly, the degree of the polyno- 
mial P,,(x) does not exceed n, secondly, 

l P,, (%) =y 

and = i 


Pp (4) = Yo + E (erto) HE he — My) a — ME -o 








Ayo ig E 
K zrna eR o) (X,—%,)...( 2X = 
k(k—l) 4z k(k—1)...1 
= Y, H RAY, + C Lary, +... EE Any, = 
=(1 +A =y (k=l, 2, ..., n) 


Note that, as h— 0, formula (3) becomes a Taylor polynomial 
for the function y. 
Thus, 
Akyy dky 


i — nik) 
lim e = (FE), TUP Co) 





Besides, clearly, 
lim (x—x,)"! = (x — x)" 
h+0 


whence, as h— 0, formula (3) assumes the aspect of the Taylor 
polynomial: 


P, (4) =y (Xe) EY’ (%) (Xp) Loe + 


For practical use, the Newton interpolation formula (3) is ordi- 
narily written in a modified form. For this purpose we introduce . 
a new variable g via the formula 


yxy 
nl 





(y= Xp)" 





g= ts 
Then 
(*— Xo) (X=) (x—xo— h) (X= Xo— 2h) 
CH OOk O hk O A 
—x—(i—1)h J 
BaO ggg). (gi 1) 


(i=l, 2 ea) 


522 Ch. 14. The interpolation of Functions 
Putting these expressions into (3), we - 


Pa entans n TE N 


— l)... (q— l 
pt )---(g AED Any (4) 


ñ! 


X—X, 
where q= z 





is the number of steps needed to reach point x 


proceeding irom point x,. This is the final form of Newton’s first 
inter polation formula. w 

Formula (4) is conveniently used for interpolating a function 
y=f(x) in the neighbourhood of the initial value x,, where g is 
small in absolute value. 

lf we put n=1 in (4), then we get a formula for linear inter- 
polation: 


P, (x) = Yo + Gay 


For n=2 we have the formula for parabolic interpolation: 
. e 
P, (x) =y, + qdy, +H, LAY 


If we have an unlimited table of values of the function y, then 
the number n in the interpolation formula (4) may be arbitrary. 
In practice, the number n in that case is chosen so that the diffe- 
rence A”y, is a constant to a specified degree of accuracy. Any 
tabular value of the argument x can serve as the initial value x,. 

lf the table of values of the function is finite, then the num- 
ber n is bounded, namely: n cannot be greater than the number 
of values of the function y diminished by unity. 

Note that-when using Newton’s first interpolation formula, it 
is convenient to use a horizontal difference table, for then the 
required values of the differences of the function are found in the 
appropriate horizonta! row of the. table. 


Example 1. Using a spacing of k= 0.05, construct Newton's Tn 
polation polynomial on the interval [3.5, 3.6} for the function 
y=e* given by the table 


x | 3,50 | 3,55 | 3.60 | 3.65 | 3.70 


y: | 33.115 | 34.813 36.598 | 38.475 | 40.447 








Solution. Form a difference table , (Table 38). Note that, oure 
accepted practice, in the diference columns we do not indicate 


14.5 Newton's first interpolation formula $23 


decimal orders (which are clear from the column of differences of 
the function). Since the third differences are practically constant, 
we put n=3 in (4). Taking x,=3.50, y,= 33.115, we will have 


P, (x) = 33.115 + 1.6989 +.0.087 24—)) 40.005 W203 


or 
P, (x)= 33.115 + 1.6989 + 0.0435q (q—~ 1) + 0.0008 39 (q— 1) (q—2) 
where 


_ x73. 50 
0.05 





= 20(x—3.5) 


TABLE 38 
DIFFERENCE TABLE FOR y=e* 














Example 2. Table 39 contains the values of the probability in- 
tegral 
2 4 2 
O (x) = eae dx 
pts 


Applying Newton’s first interpolation formula, we get, approxi- 
mately, ®© (1.43), 


Solution. We extend Table 39 by adding differences of the func- 
tion y up to order three inclusive. 

For x, we take the tabular value closest to the desired value 
a= 1 43; i.e., we put x= 1.4. Since h=0.1, then 


1,43—1.4 
q= BR 


524 Ch. t4. The Interpolation of Functions 












TABLE 39 
DIFFERENCE TABLE FOR y = 0 (x) 

x | y | Ay Aty | Aty 
1.0 0.8427 375 —74 10 
1.1 0.8802 301 —64 10 
1.2 0.9103 237 —54 9 
1.3 0,9340 183 —45 9 
1.4 0.9523 138 —36 9 
1.5 0.9661 102 —27 5 
1,6 0.9763 75 —22 6 
1,7 0.9838 53 —16 4 
1.8 0.9891 37 —12 
1.9 0.9928 25 
2.0 0,9953 


Substituting into (4), we obtain 


0.9523 + 0.3-0.0138 + 2308-1) (9.0036) + 
0.3(0.3—1 
T= aP 


(Tabular value: ® (1.43)= 0.9569; see Jahnke and Emde’s “Funk- 
tionentafeln”.) 

In practical situations it is often necessary, for a function gi- 
ven in tabular form, to find an analytic formula representing the 
tabular values of the function to a certain degree of accuracy. 
This is called an empirical formula, and the problem of constru- 
cting it is ambiguous. 

When constructing an empirical formula, one has to take into 
account the general properties of the function. Hi it is found, from 
-a difference table, that the nth differences of the function are 
constant for equally spaced values of the argument, then one can take 
the corresponding first interpolation formula of Newton for the 
empirical formula. 


(0-32) 9 9009 = 0.95686 


Example 3, Construct an empirical formula for the function y spe- 
cified by the table 





14.5 Newton’s first interpolation formula 525 . 
Solution. Forming the difference table (Table 40), we see that 
the second difference is constant. 


TABLE 40 


FINITE DIFFERENCES OF THE FUNCTION 





Using Newton’s interpolation formula in the form (3), and noting 
that h=1, we have 
y=5.242.8%—94 x (x—1) 
or 
y = 5.2 + 3x—0.2x? 
Example 4. Find the sum of the squares 
S = P4+2?+4...+n? 
of the natural numbers from 1 to n. 
Solution. We clearly have 
AS, = Sy41—Sy = (n+ 1)? 
whence 
AS, = 2n +3, AbS,=2 
and, thus, S, may be sought as a third-degree polynomial in n. 
To determine the differences 
AS,, A®S, 
we have to compute three values: S,, S,, and S,. We have 
S =, 
$,=S,+2=144=5, 


S,=S5,+3?=519= 14 
and from this 


AS, =5—1=4, 
AS, = 14-5 =9, 
AS, =9—4=5 


526 Ch. 14. The Interpolation of Functions 





and 
AS, =2 
Using Newton’s first interpolation formula and noting that 
JE A aii 
we get 
aiae eet ea) Ee (n—3) 
or 


S,= 70 (0+1) (2041) 


14.6 NEWTON’S SECOND INTERPOLATION FORMULA 


Newton’s first interpolation formula is inconvenient for interpo- 
lating functions near the end of a table. In this case, one ordi- 
narily takes advantage of Newton’s second interpolation formula, 
which we now derive. 

Suppose we have a set of values of the function 


Yi =y (x;) (i=0, 1, 2, ..., n) 
for equally spaced values of the argument 
x,=xX,+ith 
We construct an interpolating polynomial of the following form’ 
Pa (x)= ay + a, (x — Xn) + a, (X—X,) (XX p=) + 
+å, (¥—X,) (X —%y~1) (X—Xn-2) He 
vee Fan (X— Xn) (X — Xn -1)} - - (¥—%y) 
or, using the generalized power, we obtain 
Pa (x) = Qy +0, (% — Xp)™ + a, (X — xp 1) + 
+4, Cre ee o Tay (x— x )™ (1) 


Our problem is to determine the coefficients a,, @,, Qg Ags ..+, @, 
so that the equations 


Pa (x)= Yi (@=0, 1, 2, ..., n) 
hold true. To do this, it is necessary and “sufficient that 
AIP, (nD = An  (7=0, 1, ose) 0) (2) 
Setting x=x, in formula (1), we have 


Pa (Xn) = Yn =n 


14.6 Newton's second interpolafion formula 527 


and, hence, . 
ay 5 Yn 
Now, taking first differences of the left and right members of 
(1), we, have 
AP, (x) =a,-lh-+-a,- 2h (x-—x,,)+ 
+a,-3h(x—x,-,)%+...+a,nh (x — x)" Y 
From this, setting x= x,-, and having regard for relation (2), 
we get 
AP, (Xn-1)} = AYn-1 = ah 
Hence ‘ 


a, = Svan 


Similarly, forming the second difference of P, (x), we get 
AP, (x) = @,2!/h? + 0,3: 2h? (x—x,__) + 
.. +a,n(n—1) k? (x—x,)"—-4 
Assuming x=x,-,, we find 


MP, 1% aa) = Ayn oa a, oe 
and, thus, 





Ayn- 
a = Sar 

_ The type of regularity of the coefficients a; is clear enough. 
Applying the method of mathematical induction, it can be de- 
monstrated rigorously that 

NY, Se ree 

i A (@=0, l, 2, eee n) (3) 
Substituting these values into (1), we finally have 


BGR Ga 46 Se Eaa 

o t B atn) (Fe) RAM) Eee 
vee HEO (xn). (8X3) (4) 
Formula (4) is called Newton’s second interpolation formula. We 


introduce a more convenient form of (4). Let 
X— Xp 


I= 











then 
oe Zaz EPE 


(iunt 
pre =q+ 32, etc. 


528 Ch. 14. The Interpolation of Functions 


Substituting these values into (4), we get 
Pi (2) = Yn + GAY, - pe ay + 


4 LEE OTA ary, +... - Ue data=) any, (4") 


ni 


This ts the commonly used form of Néwton’s second interpola- 
tion formula. For approximate computations of the values of a 
function y, we put 


| y= P, (x) 
Example 1. Given a,seven-place log table of values of y= log,, x 





1000 3. 0000000 


1010 3.0043214 
1020 3. 0086002 
1030 3.0128372 
1040 3.0170333 
1050 3.021 1893 


Find log,, 1044. 
Solution. Form the table of differences (Table 41). 


TABLE 41 
FINITE DIFFERENCES OF THE FUNCTION Y = logit 









3.0000000 8 
1010 3,.0043214 42 788 —418 9 
1020 3.0086002 42,370 —409 8 
1030 3.0128372 41 961 —401 — 
1040 3.0170333 41 560 ad 


3,0211893 


Assume 


x, = 1050 


Then 


14.6 Newfon’s second interpolation formula 529 


Using the underlined differences, we have, by virtue of (4’), 
log, 1044 = 3.0211893 + (—0.6)-0.0041560 + 


4 CASO OF 0.000401 + 


+ 2 - 0.0000008 = 3.0187005 
The result is correct to all the digits written. 

Either the first or the second interpolation formula of Newton 
can be used for extrapolating the function, that is, for finding 
the values of the function y for values of the argument x lying 
beyond the range of the table. If x< x, and x is close to x, 
then it is best to use Newton’s a formula; in this case 


q=4= <0 


But if x>x, and x is close i x,, then it is more convenient 
to use Newton’s second pee noe that 


q= >0 


Thus, Newtoni s first EE E formula is ordinarily used for 
forward interpolation and backward: extrapolation, while Newton's 
second interpolation formula is used for backward interpolation and 
forward extrapolation. 

Note that, generally speaking, extrapolation is less exact than 
the operation of interpolation in the narrow sense. 


Example 2. Having a table of values of the function y=sinx 
from x=15° to x=55 at h=5° intervals, find sin 14° and sin 56°. 


TABLE 42 
DIFFERENCE TABLE FOR y= sin x 

















y y Ay Aty Ay 
15° 0.2588 832 —26 —6 
20° 0.3420 806 39 —6 
25° 0. 4226 774 —38 —6 
30° 0.5000 736 —44 —5 
35° 0.5736 692 —49 2 
40° 0.6428 643 —54 —3 
45° 0.7071 589 EBT = 
50° 0.7660 532 


55° 0.8192 


34 9616 


530 Ch. 14. The Interpolation of Functions 


Solution. Form the table of differences (Table 42). We see that 
the third differences of the function y are practically constant and 
so we can confine ourselves to them. 

To compute sin 14° we assume 


Xo=15° and xal? 
whence 


q= EE 0.2 





Applying Newton’s first interpolation formula and using the un- 
derlined differences, we get 


sin 14° = 0.2588 4-(—0.2)-0.0832 + 22-1) (9.0026) + 
q ELACA TED (0.0006) = 0.2419 
Tables give sin 14° = 0.24192. 
To find sin 56° we put 
X,= 650° and x=50 
whence 


S jas z5 =0.2 





Applying Newton’s second interpolation formula and using the 
twice underlined differences, we get 





sin 56° = 0.8192 + 0.2 -0.0532 + 2-24? (—0.0057) + 


2! 
-221222 0.0003) = 0.8291 


Tables give sin 56° = 0.82904. 


14.7 TABLE OF CENTRAL DIFFERENCES 


In the construction of Newton’s interpolation formulas, only 
those values of a function weré used which lie on one side of the 
chosen initial value; these formulas are thus of a one-sided nature. 

In many cases, very useful are interpolation formulas that con- 
tain both preceding and following values of the function with respect 
to the initial value. The ones most often used are those which 
contain differences located in a horizontal row (of a diagonal diffe- 
rence table of the given function) corresponding to the initial va- 
lues x, and y, or in the rows immediately adjacent to it. These 
differences Ay_,, Ay, A’y_,, ... are called central differences 


14.8 Gaussian interpolation formulas 531 


TABLE 43 














(Table 43), where 
X=x+ih (i=0, +1, +27...) y= f(x), 
AYi= Yi Y: AY = Ayi Ayp ete. 


` The corresponding interpolation formulas are called central-difference 
formulas and include the formulas of Gauss, Stirling, and Bessel [3]. 


14.8 GAUSSIAN INTERPOLATION FORMULAS 
Let us first derive the Gaussian interpolation formulas. 
Suppose we have 2n+ 1 equally spaced points 
Xonar Poasi any, Xoin Ki: AS as as Aran Xa 
where 
AX; = Xj44—x;=h= constant (i=—n, —(n—I); ..., n—l) 


532 Ch. 14. The Interpolation of Functicns 


and we know the values of the function a at these points: 
y,=f (x) (=O, +1, ..., =n) 


It is required to construct a polynomial P (x) of degree not exceed- 
ing 2n such that 
P(x,)=y, for i=0, +1, ..., kn 
From this condition it follows that 
: APP (x;)= A*y, (1) 
for all corresponding values of i and &. 
We will seek this polynomial in the form 
P (x) = a + a; (X—X,) + a, (X —X,) (x — x) + 
F a; (XX _ 4) (XX q) (X =X) + 4, (*— x1) (XQ) (XH) (KX) + 
+a; (X—x_,) (x—x_,) (X— Xo) (x — X1) (X—%,) +... 
~ + aan (X— X n-o) (x xa) (xemx) (xax) 
ee (XX qa) + agn (4X n-o) e Ea) a) aea). 
(X — xn) (*— Xa) (2) 
Introducing generalized powers, we obtain 
P (x) =a, +4, (x — xo) + a, (x — x) + a (x— x) 
+a, (x—x_,)E et Fanni C= tga) TH 
T Qon (x— x h- i (3) 


If, in computing. the coefficients a; (i=0, 1, 2, , 2n), we apply 
the same technique as in the derivation of the Newton interpolation 
formulas and take into account (1), we then get successively 


Ayo Ay ~i A?y -1 
O= Y U=’ Gea get UR aa 
a — Aty-2 a x A223 9 _ on —1) 6. Amy 
4 A} fe 2 08 Sen 1 (dn — 1) beni? an (2n)! hee 


Furthermore, introducing the variable 


= 





q= 


and making an appropriate ance. in formula (3), we get Gauss’ 
first interpolation formula: P 


— 1) as 1)q(q—1) as 
P(x)=y + gy, + LE? ary, EOE ry st 





1 =) 2 1 —1)(q—2 
Ba Coe a ) (4 ) Ady _ „+e (9+ 1419 )(q  Asy_ gs, 
(gra—l)...(g—A+ 1) pon a, (9—) hon 
es -+ (2n—1)! A? iy (n- Zaye (2n)! A’ Y-p (4 ) 


14.9 Sfirling’s interpolation formula 533 


or, more briefly, 


2 153) Jya 
P=p tiant a Sa EMP ary, 4 EU Mh et 
qtay on (gen—I))P") Aon 
+ ay A Yn- Moa Yan (4°) 


where x=x,+ gh and g™—q(q—1)...|[gq—(m—1)]. 
The first interpolation formula af ‘Gauss contains the central 
differences 


Ay,, AY- A®y_,, AY- AY- Avy, 
(see Table 43 where these: differences form the lower broken row 
indicated by the arrows). In a similar manner we can obtain Gauss’ 
second interpolation formuia which contains the central differences 
AY- AY AY- Aya, AY- AY -- 
(in Table 43 these differences form the upper broken row indicated 
by the arrows). 
Gauss’ second interpolation formula. is of the form 


+1 ” Í — 
P (x)= yy + gy, HE Aya EED Asy p 


2) (g+1) ¢(¢—1 
4 ere alg a ae 
(tn 1). -g—n tn 2n- 
+ 2n— 1)! As 1y aiT 


(q-+n) ( —1)...(g—n +1) aon 
Foe a LOTE ay, (5) 





or, more compactly, 


4 GAM pp E a 2K Nap 





P (x)=y,+qAyi4+ Y + HAY, f+ fo ee 
(q Pa = Bn ata il n $ 
nan "4 ba (nye AY (5°) 
where 
x= X% + gh 


14.9- STIRLING’S INTERPOLATION FORMULA 


Taking the arithmetic mean of the first and second interpolation 
formulas of Gauss (4) and (5) of Sec. 14.8, we get Stirling's formula 


A = A 2 2 42 A3 a 3 T 
P(x)=y tg E Lary, 4 HE, MdA 
2 2__ 2 y2__ | 2 2__ 92 A5 = A5 _ 
qe ae } Aty_ „+20 A ks dost H-2 





perone ANa 


534 Ch. 14. The Interpolation of Functions 


os le ee Ge aie a] ' i eta Ee 





- (2n—1)! 
g? (g2? — 12) (g?— 22)... .[g2—(n—1)"] non 
h er age re 
where 
X¥—Xp 
q= 


It is easy to see that 
P(x)=y; for i=0, +1, ..., +n 


14,10 BESSEL’S INTERPOLATION FORMULA 


In addition to Stirling’s formula, frequent use is made of Bes- 
sel’s formula. To derive this formula we take advantage of Gauss’ 
second interpolation formula (5) (see Sec. 14.8). 

We take 2n+ 1 equally spaced points 


Kins X-in- sree Kor tter n= Xm Mngt 
with spacing h: let ; 
=f) (=n, n+l) 


be the given values of the function y=f (x). 
Ii we take x =x, and y=y, for the initial values, then, using 
the points x, (k=0, +], ..., +n), we have 


1 > 1 =i 
P (x)= y+ gy- p+ EE Aey y p LEVEE Asy p4 
ati. genta) Amy p EMIA.. (g—n-+1) Ay _ (1) 


(2n— I}! (2n)! 
Now take x=x, and y=y, for the initial values and use the 
points x+ (k=0, +1, ..., +n). Then 
x—x X—Xy—h 
aa aan 





and, correspondingly, the indices of all the differences in the right 
. member of (1) will increase by unity. Replacing q by g—} in the 
right member of (1) and increasing the indices of all differences 
by 1, we get an auxiliary interpolation formula: 


— l1 
P(x) = yy +(q~1) Ay, + E ary, + 
esi aN; —] —2 
sole J0 ) Ay, + LEU ) (9 ) Asy + 


14.10 Bessei’s interpolation formula §35 
-+1 —1)(q—2) (q — 3 
4 TG ig )@ Aty t.. 


—?2 = —! 
l., PULA DAO griy TE „etam AEA amy! n-v(2) 


Taking the arithmetic mean of (2) and (4) of Sec. 14.8, we get 
(after some simple manipulations) the Bessel interpolation formula 


. 1 — 1) Ay- +A? 
P(x) = "544 (0—7) Ay, +2% )  A’y it Yo 4. 


1 
1—7) 0—1) qla — DaD — 2) Atyat Ay 
Ko Ae, By Aak q z A y A y 14 


(1—7) q0 —D @+1) (q—2) 
-=> BI Ager 


Be ACen Cael) NA Meas tos ae 


+ 


JUD UED 2D Ee 
(2n)! 

è Pe i 

(9) 991 G42 (0—2 (H2) =) ERI) apa n 

— ye ee ee) 


where 
X— Xp 





q= 


The Bessel interpolation formula (3), as follows from its deriva- 
tion, is a polynomial that coincides with the given function 
y= F) in 2n+1 points 


X n X in-s tees ns ngs 
in the particular case of n= 1, we have, neglecting the difference 
A*®y_,, Bessel’s formula for parabolic interpolation: 
A Ay, 1 — l Ayi — ây- Ayy—A 
P(x) = thw + (gs) Agape i ae aA Ae 


or 


P (x) =j + q Ay (Ay, — AY 1) 
where 


1— 
gon 7 q) 


In Bessel’s EN all terms ri odd differences have 


the multiplier ree therefore, when =F. formula (3) is sub- 


536 Ch. 14, The Interpolation of Functions 
stantially simplified: 


ty bx 1 A®y_,-+ A? 
p (as 1) = othe = gaat 4n 


2 
3 Ady + Aty- 3 Aty_,+Aby_, 
TT% 2 — T024 2 T 
[1:35 ... (2n—1)P Ayn tH Anti 
yy n n 
+ +(—t) 2n (2n)! 2 


This specia] case of the Besse] formula is called the formula for 
interpolating to halves. If in the Bessel formula (3) we change 


the variable using the formula q—4 =p, then it takes on a more 
symmetric aspect: 





i 
p= 
2y A2 

P (x)= tH 4 p Ay, + z ) ee 


ie o(e~a) Ay; A aa a L Ayata y, 





3! ; 
a a a E Gat) 
e a Avon p(y) = pe , 


; X APTI nt (3’) 
where pa (x— 575) : 


14.11 GENERAL DESCRIPTION OF INTERPOLATION 
FORMULAS WITH CONSTANT INTERVAL 


In a general characterization of interpolation formulas, note 
the following: when constructing the Newton interpolation formu- 
las, either the first or the last interpolation point is chosen as 
the initial value x,; for central formulas of interpolation, a mid- 

oint is taken for the initial value. The scheme given below 
(Table 44) illustrates the differences used in the basic interpola- 
tion formulas; for the purpose of convenience in surveying the 
table, the numbering of the indices has been changed in Newton’s 
second interpolation formula. 


14.11 Interpolation formulas with constant interval 537 


TABLE 44 


Remarks 


Newton's 2nd 
formula 














Stirling's 

Formula 
= = Bessel’s 

formula 











` 1st formula 











A more detailed examination of the interpolation formulas shows 
that it is advisable to use Stirling’s formula for |q|<<0.25 and 
Bessel’s formula for 0.25<{q<0.75. The first and second Newton 
interpolation formulas are used to advantage when the interpola- 
tion is performed at the beginning or, respectively, at the end of 
a table and the needed central differences are lacking [4]. 

Example 1. The values of the probability integral [3] 

x 
2 2 
(x)= y= je dx 
0 
are given in Table 45. Find @ (0.5437). 

Solution. We supplement Table 45 with finite differences of the 

given function y= (x). Taking x,=—0.54 and x =0.5437, we have 
© x—x, _ 0.5437 —0.54 
tee OO 





= 0.37 
Since i< q< in we use Bessel’s formula (3’) to get 


p=q—y=0.37—0.50=— 0.13 


538 Ch. 14. The Interpolation of Functions 











TABLE 45 
TABLE OF DIFFERENCES OF THE FUNCTION ¥ = ® (x) 
x | y Ay | Aty Aty 

0.5] 0 “5092437 86550 — 896 —7 
0.52 0.5378987 | 85654 —903 —7 
0.53 | 0.546464) | 84751 | —910 | —7 
0.54 0.5549392 83841 —917 —6 
0.55 | 0.5633233 | 82924 | —923 
0.58 | 0.5716157 | 82001 
0.57 0.5798158 








whence, using the underlined differences, we obtain 


® (0.5437) = e pO ree + (—-0.13) 0.0083841 + 


0.0169— 0.25 —0,0000910-—0.0000917 
a ge Ee 


4 SANL . (—0.0000007) = 
=0.55913125— 0.00108993 + 0.00001065=0.5580520 


Example 2 Having Table 46 of the values of the complete ane 
tic integral 


bf 
E dx 
a a 
find K (78°30). 


Solution, Put x, =78, h=1°, x=7830, whence g=0.5. If we 
take advantage of Bessel’s formula for interpolating to halves, 
then, confining ourselves to fifth differences, we have 


K (78°30) = 2.97857 + 0.5-8316. 1075 — 
—=0.125- 58, 1075 +0.023437. T4, 1075 = 


= 2.97857 — 0.04158 — 0.000978 + 0.000008 = 3.019180 


14.12 Lagrange’s interpolation formula 539 


TABLE 46 
VALUES OF THE COMPLETE ELLIPTIC INTEGRAL K (a) 























a | K (a) 











AK AK A'K AtK AK ASK 
75° 2.76806 
6461 
76° 2.83267 528 
6989 B4 
77° 2.90256 T 612 19 
7601 103 13 
78° 2.97857 715 32 —5 
8316 135 8 
79° 3.06173 850 40 18 
2196 175 26 
80° 3.15339 1025 66 = 
; sae 241 25 | 
81° 3.25530 1266 43 
Mg 332 we 
82° 3.36987 1598 159 





83° 3.50042 


84° 3.65186 





Now apply Stirling’s formula by way of a comparison: 
Bo aie -10754 


1034+135 
2 


K (78°30’) = 2.97857 + 0.5 


+0.125715- 10-§—0.0625- - 1074 — 0.0078. 32-1075 + 


+0.0117-2 48 





- 1075 = 2.97857 + 0.039792 + 


+0.000894 — 0.000074 — 0.000002 + 0.000001 = 3.019181 


14.12 LAGRANGE’S INTERPOLATION FORMULA 


The interpolation formulas derived in the preceding sections are 
only suitable for equally spaced points. A more general formula, 
called Lagrange’s interpolation formuta, is used for arbitrarily 
specified points. 

Let there be given, on an interval [a, b], n+1 distinct values 
of the argument: x, x, %,,-..., x, and let the corresponding 


549 Ch. 14. The Interpolation of Functions 
values of the function y= f (x) be known: 


P(X) = o, Pa)=% ERN f(a) = Yn 


It is required to construct a polynomial L,(x) of degree not ex- 
ceeding n having at the specified points Xp, *,, ..., x, the same 
values as the function f(x), that is, such that 


La (x)= Yi (i=0, l, 2, cone n)’ 
(Fig: 62a). 


Fig. 62a 





Let us first solve a particular problem: we construct a polyno- 
mial p;(x) such that 


pi(x;)=0 for jAi and p;(x,)=1 
(Fig. 62b). 
Y 


Fig. 62b 





These conditions can be written compactly thus: 
r {1 if jsi, ; 


where ôy is the Kronecker delta. 


14.12. Lagrange’s interpolation formula 544 


Since the desired polynomial vanishes at n points x, x,, ..., 
Xi- Xi +++) Xm it has the form 
pj (x)= Cj (x—%y) (x=) -o (Axia) (4H — Hj 44) -.- (X—X,) (2) 


where C, is a constant coefficient. Setting x= x, in formula (2) 
and noting that p;(x;)=1, we get 


Ci (X;— Xo) (Xe Xa) ee (Xi Xia) (Xi Xr) e (Xix) = 1 
whence 
o ETE er E A E E A E E 
EO (xi— xo) (Xi — x1) <e. (Xi — X= 1) (Xi — 41) e (Xi— Xp) 


Putting this value in formula (2), we have 


= (x— Xp) (x— xi) e (¥— Xi -1) (X — xit) --- (*— Xn) (3) 
(xi— Xo) (Xi— 14) <e- (Xi — Xi —1) (xi — Xii) - (Xi — Xn) 





pi (Xx) 
Let us now take up the solution of the general problem: to find 
a polynomial L,(x) that satisfies the above-indicated conditions 


La (x)= Y;- g 
This polynomial is of the form 


L(x) = X piy (4) 


Indeed, firstly, it is clear that the degree of the polynomial 
L, (x) thus constructed is not higher than a and, secondly, by 
virtue of Condition (1) we have 


La (Xj) = È pi (x) y= Pj (X) Y= Hy G=0, 1, vey n) 


Substituting the value p; (x) from (3) into (4), we obtain 


(X— Xp) (4 — Hy)... (4X — Xj 3) (4 -- Xita) .-- (XK — xy) 


La (x) = p> Gi (x; — o) (ima) oee 8 Xia) (4) e Kp) (5) 


This is Lagrange’s interpolation formula. 
We will prove the uniqueness of the Lagrange polynomial. 
Assume the contrary. 


` Let £,(x} be a polynomial distinct from L,(x) of degree not 
exceeding n and such that 


L(t) =y; (i=0, l, ..., n) 
Then the polynomial 
Q, (x) =L, (x) —L,, (x) 


542 Ch. 14. The Interpolation of Functions 


whose degree clearly does not exceed a, vanishes at n+1 points 
Koy Kiyo May eee Key thus 


Q,, (x) =0 


Hence 
L, (x) = L, (x) 


From this it follows, for one thing, that if the points are equally 
spaced, then the Lagrange interpolation polynomial coincides with 
the corresponding Newton interpolation polynomial. 

It will be noted that, in general, all the above-constructed 
interpolation formulas are obtained from the Lagrange interpola- 
tion formula for an appropriate choice of points. 

The Lagrange formula (5) may be written compactly if we intro- 
duce the following designation: 


IT, +2 (x) = (X—Xp) (x— x) nd (x—x,) (6) i 
Differentiating this product with respect to x, we get 


Tea (4) = D a)l) 0 (ka (k) o (n) 


j= 


Putting x=x; (i=0, 1, 2, ..., n}, we have 
Diri (x) = (Xi — xo) (%{— 41) ©» (Xim Xj) i—i) ee (o) (7) 


Putting (6) and (7) into formula (5), we obtain 


L, (x) = Hiya (x) 3) oo (5) 
$20 n+ (X 


i) (*— xj) 


It is worth noting that the Lagrange formula {unlike the earlier 
interpolation formulas) contains y; explicitly, which is sometimes 
very important. 

Let us consider two special cases of Lagrange’s interpolation 
polynomial. 

For n=! we have two points and the Lagrange formula is 
then the equation of a straight line y= L, (x) passing through the 
two given points: 

x—b x—a 


Yop tet bag 


where a, b are the abscissas of these points. 
For n=2 we get the equation of the parabola y= L, (x) passing 
through three points: . 
(b) -e , (x) (x— ce) (x— a) (x—b) 
Cana (ba) (b o) 4 + (a) aby ha 


"where: a, b, c are the abscissas of the given points. 


14.43 Computing Lagrangian coefficients 543 


Example 1. Construct the Lagrange interpolation polynomial for 
the function y= sin nx, choosing the points 


i 1 
Xi =Q; %45. k= T 
Solution Compute the corresponding values of the function: 


; | f 
f =Ù, y,=sing=z, y,=sinZ=1 
Applying formula (5), we get 


aa 


kast 
x € 





or 


Example 2, Given a table of the values of the function y= f (x) [3]: 
x y 


321.0 | 2.50651 
322.8 | 2.50893 
324.2 | 2.51081 
325.0 | 2.51188 


Compute the value of f (323.5). . 
Solution. Put x= 323.5, n=3. Then by formula (5) we have 
__ (823.5—322. 8) (323.5 324.2) (323.5 — 325.0) 
F (323.5) =: (321 — 322.8) (321 — 324.2) (321 — 395) » 2.50651 + 
(323.5 — 321) (323.5 —~324.2) (323.5 —325) 
+ (395, 8-— 301) (322.8 824.2) (320.8305) ` 2.50893 + 
(323.5—321) (323.5 — 322.8) (323.5 —325) 
(324, 2— 321) (324.2 — 322.8) (324,2— 325) ` 2.51081 + 
(323.5 — 321) (323.5 — 322.8) (323.5 — 324.2) _ 
F (325—321) (325 — 322.8) (3285 — 324.2) » 2.51188 = 
= — 0.07996 + 1.18794 + 1.83897 — 0.43708 = 2.50987 





"44,13 COMPUTING LAGRANGIAN COEFFICIENTS 


We give here a scheme for simplified computation of the coef- 
ficients of y; (i=0, 1, 2, ..., n) in the Lagrange formula, the 
so-called Lagrangian coefficients - ; 


mia Bo) (884) a) Atia) (kn) 
Le S E a a a a me || 


544 Ch. 14. The interpolation of Functions 


or, in a more compact notation, 


Hnt (9) 
LP (x)= e) 9 
©) (x— xi) pas (xd (3) 
where 
Hpt (%) = (x— xg). (X — Xn) 


The Lagrange formula then has the form 


n 
Ly (x)= È LP (oy 


It is to'be noted that the form of the Lagrangian coefficients is 
invariant under an integral linear substitution x =at-+b (a, b con- 
stants and a not zero). Indeed, putting 


x=at+b, x,=at;+b (j=0, 1, 2.3 n) 


in formula (1) we get, after cancelling a” from numerator and 
denominator, 


Ly (t)= (t—fp) (f— t1). . (f—ti-1) (f{—t;41).. (f— tn) 


(ti— to) (fi— ti) (ti te) (i bia). tit) (3) 
or 
(ny aes Tl,41 (£) (3°) 
COO (fto Tay (t) 
where 


Upa (= (E — t) (E— h). . (6—6) 


which completes the proof. 

Lagrangian coefficients are conveniently computed (especially 
by computing machines) by the scheme given below. First enter 
the differences as follows: 








X— Xo Ko Kı Ko Ka -e> Koin 

kimi SE. Aig oe 

X — Xo X, — 4, *—X, Xa — 4n (*) 
a TAE ha. Sn ah aa ey ee ak em rs š 

Xn— Xo Xn—Xı X_—Xp A 


Denote the product of the elements of the first row by D,, of the 
second row by D,, and so on. Then the product of the elements 
of the principal diagonal (these elements are underlined in the 
accompanying scheme) will obviously be II,,.,(x), whence it follows 
that 


LP (x)= Tut abe Ty es, n) (4) 


14.13 Computing Lagrangian coefficients 545 
Hence 
T 
La = Tyas (x $ L 
nl ) nts ( p2 D; (5) 


The Lagrangian coefficients can be reduced to a simpler form - 
in the case of equally spaced points. 


Setting 
x=x,+th 
we have 
i #=0, 4=1, , ¢,=n 
whence 
Tl ,+1 (t) =t (t— 1) (t—2)...(¢—n) 
and i 


Hp (2) =(—1)"-4 i! (n—i)! 
Substituting these expressions in formula (3’), we get 


l 11-i C 5 
LP =gh (O-PS o Lem) O 





where 
n! 
Ca = Tai 
whence l 
us 1 “ n-i Œ, 
Ly (x)= ay Han (1) M(H (7) 
i=0 
where 


fae iai S 


In the case of a constant interval h, the problem of interpola- 
tion is further simplified by the fact that tables of Lagrangian 
coefficients are available (see [5]) so that actually all the compu- 
tations reduce to multiplying the tabulated coefficients by the 
appropriate values of the function y; followed by a summation. 


Example 1. For the function y=y(x) we have the table | 














x 0.05 0.15 | 0.20 0.25 0.35 0.40 | 0.50 | 0,55 
y | 0.9512 | 0.8607 | 0.8187 | 0.7788 | 0.7047 | 0.6703 | 0.6065 | 0.5769 
t 1 3 4 5 7 8 10 11 























Find y (0.45). 
35 9616 


546 Ch. 14. The Interpolation of Functions 


Solution. To simplify computations we set 
x= 0.05 


Then the values of the new variable ¢ saat to the inter- 
polation points will be I, 3, 4, 5, 7, 8, 10, We have to find 
the vate of y for x=0.45; that is, for so Putting ¢=2, 
(i=0, 7), we arrange the computations as given in the 
ae in Table 47. 
TABLE 47 
COMPUTATIONAL SCHEME FOR LAGRANGIAN COEFFICIENTS 


Di 








Yi D; 





E]]—2 3| —4|—e|—7 | —9 |10| —725760/ 0.9512 | —0.0131. 10-4 
fé]j—1}—2}—4}—5}—7| —8) 26880] 0.8607 | 0.3202. 10-4 
1 [5||—1|—3|—4|—6| —7| —7560) 0.8187 | —1.0829.10-8 














2} 1 |f4}}—2}—3]—5| —6] 560| 0.7788 | 1.3520-10-4 
413] 2 {[2{}—)}—3} —4| 3456] 0.7047 | —2.0390. 10-4 
Spe ele fi ]{—2| —3} 2520| 0.6703 | 2.6530. 10-4 
7ļlels5 {3f 2E] — 140! 0.6065 | 0.5348. 10-4 
81716 , 4; 3} 13| —80640) 0.5769 | —0.0715.10~4 

T (9) = 3840 S-=1,6535-10-4 








whence 


i=7 
y (0.45) = TI (9) $? = 11(9).S = 3840. 1.6535: 10-* = 0.6349 
i=0 





Example 2. The function y =cosx is given by the table [5] 


x | 5.0 | 5.1 | 5.2 5.3 









0 .468516671 
2 


0.55437 4336 
3 


0. 283662185 
0 


0.377977743 
1 


y 
é 























x | 5.4 | 5,5 5.6 | 5.7 





.0,8347 12785 


a 
i 


0. 775565879 
6 


y 


0. 634692876 | 0. 708669774 
t 


4 5 














Find cos 5.347. 


14.14 Error estimate of Lagrange’s interpolation formula 547 


Solution, Make a change of variable by the formula 
x=0.14+5 
Then the values of the variable ¢ corresponding to the interpola- 


tion points will be 0, 1, 2, 3, 4, 5, 6, 7 and the desired value 
x=5.347 will become t=3.47. Noting that the points t,=i 


(i=0, 1, ..., 7) are equally spaced, the computations can be 
carried out by the indicated scheme (see Table 48). . 
TABLE 48 


COMPUTATIONAL SCHEME OF LAGRANGIAN COEFFICIENTS 
FOR EQUALLY SPACED POINTS 








(Tye iC Tel = wo 
0 5.0 0, 283662185 3.47 —i —0. 08174702 
1 5.1 0 .377977743 2,47 7 107119198 
2 5.2 0.468516671 1.47 —21 —6. 69309530 
3 5.3 0. 554374336 0.47 35 41 . 28319523 
4 5.4 0.634692876 , | —0.53 —35 41 , 91368048 
5 5.5 0. 708669774 —1.53 2I —9. 72684003 
6 5.6 0. 775565879 —2.53 = . 2.14583444 | 
7 5.7 0.8347 12785 —3.53 1 — 0. 23646254 





H = 42.8848749 S = 69. 67575724 


From Table 48 we have 
T 


11(3.47)= I] (3.47 —i) = 42.8848749 
i=0 


and 


ee 











On the basis of formula (7) we get 
cos 5.347 =z" TI (3.47). S = 0.592864312 


14.14 ERROR ESTIMATE OF LAGRANGE’S INTERPOLATION 
FORMULA 
In Sec 14.12, for the function y=f(x) we constructed the Lag- 


range Do polynomial L,,(x), which assumes at the points 
Xo» Xi» +++, x, the given values 


Yo == F (Xo) y =f (%y), sz Yn =T (Xn) 


548 Ch. 14, The Interpolation of Functions 


-The question arises as to how close the constructed polynomial 
approaches the function f(x) at other points, that is, as to how 
great is the remainder term 


Ry (2) =f) Ln (x) | 
To determine this degree of approximation we impose additio- 
nal restrictions on the function y=f(x), namely, we will assume 


that in the range under consideration, a<ix<b, which contains 
_the points of interpolation, the function f(x)-has all derivatives 


F(x), Fo (x), ..-, Ft (x) up to the (2+ 1)th order inclusive. 
We introduce an auxiliary function 
u (x) =f (x)— Ln (x)— RY, 45 (x) (1) 
where 


Ipi (x) = (x— Xp) (x —%) ve .(%—X,) 
and & is a constant coefficient which will be chosen below. 
The function u(x) obviously has n-+1 roots at the points 
See ae 
Now choose the coefficient k so that u(x) has an (n+ 2)th root at 


any (fixed) point x of the interval [a, 6], which point does not 
coincide with the points of interpolation (Fig. 63). For this it 





suffices to put 


7 G@)—L, (%)—AT, 41%) =0 


whence, since Ilp}, (x) #0, 


tæt E 
Insi (x) 
For this value of the factor k, the function u(x) has n +2 roots in 


the interval Jæ, 6] and will vanish at. the endpoints of each of 
the intervals 


[Xo cae [xi Xalo TE [x x|, [x, Krvalk ee ey caer Xn] 


14.14 Error estimate of Lagrange’s interpolation formula 549 


Applying Rolle’s theorem to each of these intervals, we see that 
the derivative uw’ (x) has at least n+1 roots in the interval |z, b]. 
Applying Rolle’s theorem. to the derivative w’ (x), we see that the 
second derivative u” (x) vanishes no less than n times on the inter- 
val fa, b]. 

Continuing, this reasoning, we conclude that on the interval 
[a, b] under consideration, the derivative u”*®? (x) has at least” 
one zero, which we denote by & thus, u”+®? (§)=0. 

From formula (1), since 


Le) (xy) =O and TAY (x)=(n4+ 1], 
we have 

uth (x) = f"t) (x)— k(n +1)! 
For x=€ we obtain 


0= fer (E)—k(n + II 


for +h) ( Pees) 
c= RFD (3) 


whence 


Comparing the right members of (2) and (3), we have 


POL O H 
ma CHD! 





Thus 


FOL, & = Gan M,a G) (4) 





Since x is arbitrary, formula (4) oe be written thus 
ME 
Ry (x) =F (2) — L, (0) = FE ya (2) (5) 


where Ẹ depends on x and lies inside the interval [a, 6]. 
Note that (5) is valid for all points of [a, b] including. the 
points of interpolation. 
Denoting 
Myix= max |f"*”(x)] 
agx<sb 
we obtain the following estimate for the absolute error in the 
Lagrange interpolation formula: 
[Rn [= LF) Lr O S EE [nas (| (6) 
where 
Maa, (4) = (4 — xa) (x— x) sae (x—X,) (6°) 


550 Ch. 14. The Interpolation of Functions 


Example. To what degree of accuracy can we calculate V115 by 


means of Lagrange’s interpolation formula for the function oo Vx 
if we choose the interpolation pints X= 100, x = 121, x, = 144? 


Solution. We have 





3 
vats ee a Se. Pe 
whence “ 
we 3 1 3 POA 
M J |= a Gage = oe § for 100<x< 144. 


On the basis of formula (6) we obtain 
IR i< 5-10-84 | (115—100) (115—121) (115—144) | = 
= 75: 10-*.15-6-29 æ 1.6- 10-8 


14.15 ERROR ESTIMATES OF NEWTON’S 
INTERPOLATION FORMULAS 


If the points x, Xj, ..., x, are equally spaced, and 
Xi xh (i=0, 1, 2, ..., a—1) 


then, putting 


X— Xp 





q= 


we obtain, on the basis of (5) of Sec. 14.14, the remainder term 
of Newton’s first interpolation formula: 


ñ —1)... (g— 7 
R, (x)= arr, LE) pena G) (1) 
where Ẹ is a value intermediate between the abscissas xo, Roo eye P 


and the point x al hand. Note that for the case of interpol ation 
in the narrow sense E€ [xo x,]; for- extrapolation it is possible 
that E¢[%, xn]. 

Similarly, ee 


X—Xp 


ae as 
in formula (5) of Sec. 14.14, we get the remainder term of New- 
ton’s second interpolation formula: 


Re) ate Me ee) (2) 


where € is some value intermediate between the abscissas 
Xo X ---, Xa and the point x. 


14.45 Error estimates of Newton's interpolation formulas 551 


Ordinarily, in practical computations, Newton’s interpolation 
formula terminates with terms containing differences which, to 
_ within the limits of the specified accuracy, may be considered constant. 

Assuming that A”*4y are nearly constant for the function 
y=f(x) and h is sufficiently small, and taking into account that 


An+ly 
jeto (x) = lim PTEN 





we can approximately set 
g Anti 
fren (E) = ai 


In this case the remainder term of Newton’s first interpolation 
formula is equal to 


— l 
R, (x) & glg aee UTO Ariy, 


Under these conditions, we get for the remainder term of Newton’s 
second interpolation formula the expression 





+1). s 
R, (x) LEED EO) Aey 


Example 1. Five-place log tables give the logarithms of integers 
from x= 1000 to x= 10,000 to within a limiting error of 4.107%. 


Is linear interpolation possible to the same degree of accuracy? 
Solution. Setting 


y = log, ox 
we have 
, M , y M 
y=- and y= ee 
where M = 0.43, whence 
M,=max|y"|< = = 5.1078 





From formula (1) we get, for n=2 and h=1, an estimate of the 
error of linear interpolation: 


— Iİ t— 
RWIS GAO M, s q1 5 9) „z 107° 





Since foi 0&Kqg< 1 we have 


g—a=4-(g-49) <4 


552 Ch. 14. The Interpolation of Functions 


we finally get 


IR (WS 


Consequently, linear interpolztion is quiie admissible. 

Sxampie 2. Estimate the error resulting from approximating the 
function f (x}=sinx by the fifth degree interpolation polynomial 
P,(x) which coincides with the given function for the values 
x=0°, 5°, 10°,,15°, 20°, 25°. 

Solution, Here /‘ (x)= —sinx, and so |f® fx) <1. On the basis 
of formula (1) we have 


ro a] 


+1078 < 1077 


n 5n 


|sinx— P; w< (+3) (x—5) (« —5) (x5) («—})| 


For example, for x= 1230 = arc 0.21816 we obtain 
|sinx—P, (x)| < 2.2.1079 


14.16 ERROR ESTIMATES OF THE CENTRAL 
INTERPOLATION FORMULAS 


We give without proof the remainder terms for the Stirling and 
Bessel formulas [3]. 

(a) The remainder term of Stirling’s interpolation formula. lf 2n 
is the order of the largest difference used in a table and 
x€[x,—nh, x+ nih], then 

h?n+1 fenth (E 


Ry (8) = aE og? — 1) (g? — 2) (9° —38) « .. (n) 


where 
L— Xp 


q= h and EE [xo— nh, Xo+ nh] 





But if the analytic expression of the function f(x) is not known, 
then for small A we assume i 


A2n+1 ae A?Nn+1 ae 
R,, (x) = eet a q (g?—12) (9—2)... (q?—n?) 


(b): The remainder term of Bessel’s inter polation formula. lf 2n+1 
is the order of the largest difference used in the table and 
xE [xo— nh, x+(n+1)hA], then 

Aent+2 
Ra) = Bazar fe” OPPP)... (qr) 
x[q—(n-+ 1)] 


14.17 On the best choice of interpolation points 553 


where 
L—Xp 


g=25% and EE [x —nh, x+ (n+1)A] 
However, if the function f(x) is tabulated and the interval A is 
small, then we assume 


A2n+2 Yon-it A2n+2 Yon 


R, (*) © — p mnga IP — 1) (G24) x... 
+ X(G—n*) [g—(n-+ 1)] 
In particular, for q=4 we obtain the error for interpolating to 


halves: 


Bent efen te) j 1-3-5 .., (2 1}? 
Rye HN alse erie 
or 
Anty i Aan tty, na [1-35 ... (Qn+DP 
R x SS Eee (1) +12 2 
n 2 (2n +2)! ga} 
li we put 
t 
G=Pr> 


then the formula for the remainder term of Bessel’s formula takes 
the form 


l R.) = pgi» È) (#—7) (p+) Sa [r-e] 


14.17 ON THE BEST CHOICE OF INTERPOLATION POINTS 


In analyzing formula (5) of Sec. 14.14, we see that the error 
R,(%) in Lagrange’s formula is, to within a numerical constant, a 
product of two factors, one of which, f"+t"(&), depends on the 
properties of the function f(x) and is not amenable to regulation, 
while the magnitude of the other, Il,,,(x), is determined exclu- 
sively by the choice of the interpolation points (abscissas). 

If the abscissas x; are not arranged suitably, the upper bound 
of the modulus of the error R, (x) [see (6), Sec. 14.14] may be 
very large. For example, if we bunch the abscissas x; near one 
end of the interval fa, b}, then R,(x) will, generally speaking, 
for [=b—a> 1, be great at the points x close to the other end 
of the interval. The question thus arises of an optimal choice of 
the interpolation points x; (for a given number of points n) so 
that the portion of the error that depends on us—the polynomial 
M, (x)—has the least in modulus maximal value on the inter- 
val [a, b], or, to put it briefly, “deviates least of all from zero 


554 Ch. 14. The Interpolation of Functions 


on [a, b]”. This problem was solved by the Russian mathema- 
tician P. L. Chebyshev [2], [6] who proved that the best choice 
of abscissas (interpolation points) in the indicated mieaning is 
given by the formula 











b b— 
xp SE 
where 
2i+1 7 
ți =—cos at (i=0, 1, 2, ..., n) 


are zeros of the so-called Chebyshev polynomial T,,,(x). We 
then have 


[Tes olc (A) 


It is interesting to note that these points are not equally spa- 
ced but bunch-up near the endpoints of the interval. Even for 
such a choice of points, one cannot, in the general case, guaran- 
tee that the absolute value of the error will be arbitrarily small 
for sufficiently large n. 

We will now remark generally on the determination of errors of 
the interpolation formulas. If the maximal differences are practi- 
cally constant, then the result of interpolation in the narrow sense 
is ordinarily correct to as’ many decimal places as given in the 
tabulated data, and so it is not necessary to estimate the errors. 
When using Lagrange’s interpolation formula there is no possibi- 
lity of following the course of finite differences; if possible one 
should therefore estimate the remainder term. 

lf the function f(x). is tabulated and the analytic expression is 
not known, then, strictly speaking, it is impossible to estimate 
the error of the interpolation polynomial. True enough, since for 
a given polynomial it is theoretically possible to construct an in- 
finity of distinct functions that coincide with the polynomial in 
the given set of points, at intermediate points the deviation 
of the interpolation polynomial from the function may be arbi- 
trarily large. However, if the nature of the function is such that 
its graph is a smooth curve, then it is possible to determine 
approximately the errors of the interpolation polynomials to a 
high degree of confidence on the basis of the values of higher-order 
differences by the above-indicated formulas. 


14.18 DIVIDED DIFFERENCES 


When constructing difference tables, we have assumed that the 
values of the argument of the function are equally spaced, that is, 
that they have aconstant interval. However, tables for unequally 


14.18 Divided differences 555 


spaced values of the argument (tables with variable interval) are 
also used. For example, empirical data are of this nature. For 
tables with unequally spaced values of the argument (variable 
interval), the concept of finite differences is generalized to so-called 
divided differences. 

Suppose we have a tabulated function y=f(x) and x, Xi, 
X, .-. are the values of the argument and Yos Y1, Yz, ..- are the 
corresponding values of the function, where the differences 


X= Xja,7—- 4,30 (i=0, 1, 422) 
are unequal. 
The ratios 
_ Yi+1 yi 
[xi Xis] es 
(i=0, 1, 2, ...) are called divided differences of the first order. 
For example, i 


Yi —4o 


[xos x] = an’ [x, x] a= 2 


, ete. 
Xg— X1 


Similarly, we define second-order divided differences 


[xp4a, Xite] — lxi Xi4a] 


[x;, Xiti Xisal i Xi} Xi 


(i=0, 1, 2, ...). For example, 


[Xo, Xis x, = Lat zl ežu 
and so forth. 

Generally, nth order divided differences are obtained from (n—1)th 
order divided differences by means of the recurrence relation 


Xitis cry Xpand [his + Mtn 
[xi Xiti eer Kryp) = Biotien ti o titana 
eiea as ae d (1) 


Note that divided differences remain unchanged under a per- 
mutation of the elements; that is, they are symmetric functions 
of their arguments. For instance, 


— 917940 Yoon p, 
[xo “J = po ea [xi x], ete. 


Divided differences are ordinarily arranged in tables of the type 
shown below (Table 49). 


556 Ch. 14. The Interpolation of Functions 


TABLE 49 
TABLE OF DIVIDED DIFFERENCES 


Divided differences 


2nd | 3rd 


Io» xı] 
[xo £i xa] 
[Xi x2] , [xo X41, Xz 43] 
[Xi Xz x3] [£o X1; Xe, X3, %4] 
[£2 %3] [xi z X3 xal] 
[x2 X3» x4] 
[x3 x4] 





Example. Form the-divided differences for a function specified by 
the following table: 










x { o | 02 | 03 | 04 | o7 | 09 


y | 132.651 | 148.87 | 157.464 | 166.375 | 195.112 | 216.000 





Solution. Successively applying formula (1), we have 
__ 148,877 — 182.651 


eer x] = Ho p 7 81.13, 
[x,, x] = ene aT 85.87, 
85.87 — 81.13 
[xo Xis x] = ap = 15.8 


and so on. The results are tabulated in Table 50. 


14.19 NEWTON'S INTERPOLATION FORMULA FOR UNEQUALLY 
SPACED VALUES OF THE ARGUMENT 

Using the concept of divided differences, we can represent La- 

grange’s interpolation formula in a form similar to Newton’s first 


interpolation formula. First let us prove a lemma of interest in 
itself. 


Lemma. /f y= P(x) is an nth degree polynomial, then its divided 
difference of order (n+ 1) is identically zero: that is, 


1X, Xo Xis con x,| =0 


for any set of distinct numbers X, Xo, Xp +++1 Xp 


14.19 Interpolat. formula for unequally spaced arg. values 557 


TABLE 50 
DIVIDED DIFFERENCES OF THE FUNCTION y 











gy Ist 2nd | 3rd | 4th 
0 132,651 
81.13 
0.2 140.877 15.8 
85.87 i l 
0.3 157.464 16.2 0 
89.11 l 
0.4 166.375 . 16.7 0 
95.79 l 
0.7 195.112 17,3 
104.44 
216.000 - 





Indeed, if P(x) is a polynomial of degree n, then 


P (x) —P 
[= m] =< Qa ae P(x, xa) 


is a polynomial of degree (n—1) in x. Furthermore, 





cs P (x, Xo) —P (Xo, 4) 


[x, xo x] x5 


P (x, Xo» X) 
is a polynomial of degree (n— 2) in x. Indeed, the function P (x, x,) — 
— P (xo, x1) =P (x, x,)—P(x,, x,) has the root x =x, and, hence, 
on the basis of the remainder theorem, the polynomial P(x, x,)— 
— P(x, x) is exactly divisible by the binomial x— x. By similar 
reasoning we see that 

[x, Xos reed zeal =P (x, Xos very Xn-1) 
is a polynomial of degree zero, that is, 


Pie aptas oa HG 


whence 


C—C 
ee Xorni An] S =0 





Now suppose P (x) is a Lagrange polynomial of degree n such 
that 


P(x) =F (x) =y (1) 


(i=0,1,..., n) where y=f(x) is the given function. Denote by 
P(x, Xx), P(X, Xo X) oe PM Xo oo ey Xp) the successive divided 


558 Ch. 14. The Interpolation of Functions 
differences of P(x). We have 


P (Xo, x)= [xo xil 
P (xis Xis Xy) = [Xp, Xis Xh ` (2) 


. e e a ‘l d i e 


POR a ers = A Aea a 


Besides, on the basis of the lemma 


P (x, Xos -ees Xa) =0 (3) 
By definition we obtain 
P (x)— P (xo 
xX — Xo =P (x, Xo) (4) 
whence 
P (x) =P (x,) +P (x, Xo) (x— xo) (5) 
From the definition of divided differences it follows that 
P (x, Xo oaos Xp) = PX oes aa" Ge eevee ey) 
whence 


P (X, Xo se ey Sm- D) =P (Xo ee ey Xml t 
+(x — Xn) P (X, Xy senate) (m=1, 2, ..., n) (6) 


Using (6), we obtain successively from (5) 


P(x) =P (x) +P (x, xo) ($ — xo) = 
=P (x) +P (xo x1) (x-— xo) + 
+ P(X, Xo X) (X — X) (Ky) = 
=P (x) +P (Xo x1) (Œ — x) + 
+P (Xo Xis Xa) (X — Xo) (X—%) +... 
oe HP (Xo, Xis eos Xn) (*— xo) (x— x}. - (X— Xn) + 
FP (x, Xo eee Xa) eq) HH) -(x— xp) 


or, taking into account the equations (2) and (3), we finally obtain 
Newton’s interpolation formula for unequally spaced values of the 
argument 
P (x) =Yo+ [ko x1] (4%) + [Xo Xa, Xa] (X—%) (X—M) +... 
-+ [xo Xis sees Xa] (x— xo) (x—%,). ' (X—%,-1) (7) 


The error in formula (7) is, as usual, equal to 


Rx) =F) P (#) =F (xx) (2m). 4) 8) 





14.20 Inverse interpolation for equally spaced points 559 


where Ẹ is an intermediate value between the points x,, Xi, ..., Xp 
and x. 

Example. Form the interpolation polynomial for the function y=f (x) 
given by the table 


X | 0 | 2,5069 5.0154 | 7.52270 





| 0. 3989423 | 0. 3988169 | 03984408 | 0.3978138 





Using this polynomial we find f (3.7608). 
Solution. We find the divided differences of the function y (Table 51). 
. TABLE 51 











DIVIDED DIFFERENCES OF THE FUNCTION y 
ee 
x Y | lst 2nd | 3rd 
0 0,3989423 
—500 
2.5069 0.398869 —199 
— 1499 0 
5.0154 03984408 —199 
—2496 
7.5270 0.3978138 


Using formula (7), we find 
y = 0.3989423 — 0.0000500x — 0.00001 99x (x — 2.5069) 
whence 


y (3.7608) = 0.3989423 — 0,0000500 - 3.7608 — 
—.2.0000199- 3.7608 (3.7608 — 2.5059) = 0.3986604 


14.20 INVERSE INTERPOLATION FOR THE CASE 
OF EQUALLY SPACED POINTS 


Suppose we have a function y=f(x) given in tabular form. 

The problem of inverse A consists in determining a 
value of the argument x from a given value of the function y. 

Let us first examine the case of equally spaced points. Here, the 
usual method is that of successive approximations. 

We assume that the function f(x) is monotonic and the given 
value of y lies between y, =f (x) and y, =f (x,). 

Replacing the function y by Newton’s first interpolation polyno- 
mial, we get 


3°Yo i 
y=n+ a+ ai a E +2 9 (g— .. (qnt l) 





560 Ch. 14. The Interpolation of Functions 


whence ¢ =@ o where 


Yi 
ọ (4) = Ayo — qg(q—l)— 


For the initial approximation we take 


-.(g—n-+}) 











sYT Yo 
Fo “Ky, 


Then, applying the method of iteration, we obtain 


Im=P(Gn-1) (m=1, 2, ...) (1) 


If f(x)@C™*» [a, b] where the interval [a, b] contains the inter- 
polation points and the spacing h is small, then the process con- 
verges: : 

lim Gn =4 
Mao 
where q is the true solution. 

Actually, the process of iteration is continued until digits appear 
which meet the required accuracy; and one assumes q =~ q,, where 
qs is the last approximation. 

Having found q, we determine x from the formula 

X— Xp 

m g 
whence 

x=x, 4+ gh 


Example 1. Using the values of the function y = log,,x given in 
the table 





find the value of x such that y=log,,x = 1.35. 
Solution. Form the difference table. 


TABLE 52 
FINITE DIFFERENCES OF THE FUNCTION y 





1.3010 
25 1,3979 792 
1.4771 








14.20 Inverse interpolation for equally spaced points 561 


Assuming y, = 1.3010, we have 
YY _ 1.35—1.3010 490 





40 "Ry 0.0969 = 969 = 0.506 
Then, to three decimal places, we successively obtain 
qı = 0.506 — z555 0.506 (1 — 0.506) = 0.506 —0.023 = 0.483 
qa = 0.506 —z e5 0.483 (1 —0.483) = 0.506 —0.023 = 0.483 
We take 
f q = 0.483 
whence 


x= x, 4} qh =20 +- 0.483-5 = 22.42 


By a table of antilogarithms we have x= 22.39. The considera- 
ble divergence between the computed value and the exact value is 
due to the large spacing h=5. 

We applied the iteration method to solve a problem of inverse 
interpolation using Newton’s first interpolation formula. We could 
also similarly apply this method to the other interpolation formu- 
las as well: to Newton’s second formula, Stirling’s formula, Bessel’ s 
formula, etc. This is illustrated in the following example.. 


Example 2. Table 53 contains the values of the probability inte- 
gral [3] 


x 


= 2 -x 
= dx 


For what value of x does the integral y equal 5? 





TABLE 53 
VALUES OF THE PROBABILITY INTEGRAL 











: | j | i 








Aty A?y Aty 
0.45 0.4754818 
91737 
0.46 0.4846555 —840 
90897 —11 
0,47 0.4937452 —851 1 
90046 —10 
0.48 0 .5027498 —861 2 
- 89185 —8 
0.49 0.51 16683 — 869 
88316 
0.50 0.5204999 








36 9616 


562 Ch. 14. The Interpolation of Functions 


Solution, We supplement Table 53 with the finite differences of 
the function y. The closest tabular value of the argument x cor- 


responding to the value of the function y=5 is x, =0.47. It is 
convenient here to use Bessel’s formula. 
We have x, =0.47, h=0.01, y=0.5. 


Substituting these values into (8) of Sec. 14.7 and using the 
appropriate tabulated values, we obtain 


0.5 = 0.4982475 + 0.0090046p + Ž = (575) -1077+ 





ee EEY 10): 1077 (2) 
Then, dividing both members of-(2) by 0.0090046 and isolating 
the term containing p to the first power, we get 
p = 0.194623 + 4.753- 1073 (p? — 0.25) + 1.85. 107*p (p? — 0.25) (3) 
For the first approximation of the parameter p we take 
p™ =0.194623 l 
Putting p™ in (3), we get the second approximation: 
p’ = 0.194623 + 4.753-10~* [(0.194623)? — 0.25] + 
+ 1.85-107°-0.194623- [(0.194623} — 0.25] = 
= 0.194623 — 0.001008 — 0.000001 = 0.193614 
Analogously, putting p® in place of p into (3), we get the third 


approximation: 
p® =0.193612 


Since the first five decimals coincide, the iteration process is ter- 
minated. ` 
Then, we successively find 


q=p+4= 0.693612 
and 
x= xo + għ = 0.47 + 0.01 -0.693612 = 0.47693612 


This value is correct to the sixth decimal place. 


14.21 INVERSE INTERPOLATION FOR THE CASE 
OF UNEQUALLY SPACED POINTS 


The problem of inverse interpolation of a function for the case 
of unequally spaced values of the argument x), x,, ..., x, canbe 
solved directly by means of Lagrange’s interpolation formula. To 


14.21 Inverse interpolation for unequally spaced points 563 


do this, it suffices to take the variable y as the independent vari- 
able and to write a formula expressing x as a function of y (Fig. 64): 





e (y— 41) (Y— y») (Y — yi- Wyt) e Yn O a, 1 
4S Gi.) Wya) WY- W iaa) > Yi Yn) t (1) 
I 
Fig. 64 
where y=f(x) (i=0, 1, ..., n). One could also use Newton’s 


interpolation formula for unequally spaced values of the argument 
(see Sec. 14.19) taking y as the argument: 


X= Xo + [Yor j] (Y—Yo) + [Yor Yr» Yo] (Y— Yo) (Y—- M1) + «+» 
+E [Yor Yar +++ Yn) YY) (YH) (Y— Yn) (2) 


where [ys fil, [Yor Yrs Yl <--> [Yor Yar «+++ Yn] are the approp- 
tiate divided differences. 


Example. Solve Example 2 of Sec. 14.20 with the aid of Lagrange’s 
formula for inverse interpolation. 


Solution. We confine ourselves to four values: 
X,= 0:46, x,=0.47, x,=0.48, x,=0.49 
Putting 
u = 107 y—- 10? 


we have the following table 





u | —1505 —62 548 27 498 | 116 683 


564 Ch. 14, The Interpolation of Functions 


The given value y=% corresponds to u=0. Using formula (2), 
where y is replaced by u, we obtain 











= 62 548. (—27 498).(—116 683) 
X = (753 445 4 62 548). (—153 445— 27 498) (153 445—116 a3) 0-45 + 
153 445: (—27 498)-(—116 683) 0.47 
T (62 548-4 153 445) (—62 548— 27 498) (—62 548—116 683) ~~ + 
153 445.62 548.(—116 683) 
T (27 498 4+ 153 445) (27 498 + 62 548): (27 498 — 116 683) 0.48 + 
153 445-62 548-(—27 498) 0.49 = 





+ (e683 163 445)-(116 683 -F 63 548) (116 683— 27 498) 
= — 0.020779 + 0. 157737 + 0.369928 — 0.029950 = 0.476936 


14,22 FINDING THE ROOTS OF AN EQUATION 
BY INVERSE INTERPOLATION 


In conclusion, note that the solution of the equation 
f (x)=0 


can be reduced to the problem of inversé interpolation. This 
requires forming a table of values of the function y=f(x) and 
constructing a corresponding table of finite differences for the 
values of x that are close to the root, and then applying the tech- 
niques of inverse interpolation to find the value of x that corres- 
ponds to y=0. 

Example. Using the given table of values of the Bessel function 
y = Jy (x) 






x | 2.4 | 2.5 | 2.6 





y | 0.0025 | —0.0484 | —0.0968 


find the root of the equation Ja (x) =0 lying in the interval (2.4, 
2.6) to within 107%. 
Solution. Form the difference table (Table 54), 


TABLE 54. 
FINITE DIFFERENCES OF THE BESSEL FUNCTION y= Jy (x) 








x y Ay | A*y 
2.4 0.0025 —509 25 
2.5 —0. 0484 — 484 

2.6 —0. 0968 


14,23 Interpolation for expanding a secular determinant 565 
Putting y=0 and x,—2.4, y,=0.0025, we get, on the basis of 
formula dy of Sec. 14. 20, 


_ Y—Y _ 0.0025 __ 
Ji = Tn = 50505 = 9-049, 








A2 
Jı = fo tapt == 


= 0.049 —-*?..0.049-0.951 = 0.049 —0.001 = 0.048, 


2.509 a 
qa =0.049— 7 nae -0.048-0.952 = 0.049 —0.001 = 0.048 
We take 
g =0.048 
whence 
x = X, + gh = 2.4 4+- 0.048 -0.1 = 2.405 


From the tables, 
x = 2.4048 


14.23 THE INTERPOLATION METHOD FOR EXPANDING 
A SECULAR DETERMINANT 
Interpolation of functions may be used for expanding a secular 
(characteristic) determinant (see Chapter 12) 
D(A) = det (A—AE) 


where A = [a,,]. 
Choose equally spaced points 
W=O, Awl, ...,4,=n 
and for the determinant D(A) compute the corresponding values 
D(OV)=D,, DU)=D,, ..., Din)=D 


Forming a horizontal difference table for the sequence of numbers 
D(0), D(1), ..., Dn), we find the differences AiD (0) (¢=0, 1, 

.., A) in the usual manner. From this, using Newton’s first inter- ` 
polation formula, we find’ the polynomial expression for the 
secular determinant f 


Da=D0+ DATO G1), i+ I) (1) 





If we put 
Ce ee ts Eend G=, 2.) Q) 


566 “Ch. 14. The Interpolation of Functions _ 


Wen a few simple PAEAN yield A. A. Markov's formula: 
D(h)=D(0)+ Dat Yogi D (0) (3) 


Tables of coefficients Cori [8] have been compiled to simplify com- 
putations by formula (2). 

In the more general case, if we take the numbers à; =a-+ih 
(i=0, I, ..., n) for the interpolation points, then formula (3) 
becomes i 


DM) =D (a+ È A—a)” D cnhA'D (0) (4) 


Although the method of E proposed here requires 
laborious computations of n+ 1 determinants of order n, it is 
nevertheless convenient due to its simple computational scheme. 
What is more, it is applicable to the expansion of a more general 
type of determinant: 


F (à) = det [F;; (0)] 
where fi; A) are integral polynomials in A. 
Example. Using the method of interpolation, expand the charac- 
teristic determinant 


i=; & go 4 
a 2 3 
Di\)=|5 2 i—x 2 
4 3. 2 I-A 


(ci. Sec. 12.3, Example). 

Solution. We successively compute D (i) for i=0, 1, 2, 3, 4 to get 
D(0)=—20, D(1)=—119, D(2)=—308, 
D(3)=—575, D(4)=—884 
The finite differences A’D(0) (¿=0, 1; 2, 3, 4) are- given in 

Table 55. 


TABLE 55 
FINITE DIFFERENCES OF THE NUMBERS D (A) 























| D N) AD (M) ADN | AD | AMD (A) 
0 —20 —99 —90 12 24 
I —119 — 189 —78 36 
2 —308 —267 —42 
3 —575 —309 
4 —884 






14.24 Interpolation of functions of two variables 567 








Since = 
A} 
Ts 
A(A—1)_ M A 
oy 2 2°? 
A(A—1)(A—2) AS 
f 3! ~ 6 2 ' 3? ; 
a (àÀ—1) (A—2)(A—3) AS 43 1M3 
4) ~O4 4 n a | 


it follows from formula (2) that 


l 
faa» C5 FT 
l } l 
fgg Sa o> Sge 
l l 1] 1 
Cau =g’ Sse — Ge CaS ag UF 7 


whence, using the Markov formula (3), we get 

D(A) = D(0) + [cn AD (0) + c,,A®D (0) + ¢,,A°D (0) + c,,A4D(0)] A + 
+ [eygA*D (0)+ ¢,,A°D (0) + €,,A*D (0)] 1? + 
-t [e,,A°D (0) +¢,,A4D (0)] à? +¢,,A§D (0) M = i 
= — 20 + (— 99-14 90-5412- 5—24. a) ht 


l 1, 11 ) 1 
+ (90. 5-12-5424. 55) + (12. 524-5) a+ 
4 24 - pg At = — 20— 56A— 40M? — 409 + 28 


“44,24 INTERPOLATION OF FUNCTIONS OF TWO VARIABLES 
Let the function 
zs=f (x, y) 


be specified on a set of equally spaced points (x, y; 
(i, j=0, 1, 2, ...), where 


X;=X_+ih, Y;= Yo + jk 
and i 
h=Ax;=constant, k= Ay;= constant 
For brevity we introduce the notation 


z= f (Xio y;) 


568 Ch. 14. The Interpolation of Functions 


The values of the function z may be arranged in a double-entry 
table (Table 56). 


TABLE 56 
THE VALUES OF A FUNCTION OF TWO VARIABLES 





Interpolation of a function of two variables 
z=f (x, y) 


that is, the finding of nontabulated values, may be performed suc- 
cessively with respect to each variable x and y separately. For 
example, let it be required to find the value 


F=f, T) 


Interpolating in appropriate fashion the chosen functions of one 
variable x, 


fa (x) =F (x, Ys) 


where y ~ y, we find the values f,(x). For this purpose one uses 
the corresponding rows of the double-entry table. Regarding the > 
values f,(x)=/(x, y,) thus obtained as the values‘of the function 
f(x. y) of ome variable y, we find the desired value f(x, y)=z 
by means of one of the interpolation formulas. 

The interpolation can also be carried out in the reverse order. 


Example. The values of the “aftereffect” function 


+o 


: f(x, y)= f en 8z -z-xe—? dz 


-%0 


‘are given in the following table (see Jahnke and Emde’s Funk- 


14.24 Interpolation of functions of two variables 


tionentafeln): 


0.4 | 0.7 
i 


2.500 1.429 
2.487 1.419 
2.456 1.400 





Find f (0.5, 0.03). 


569 


Solution. Form the tables 57a, 57b and 57c using the rows of 


the given double-entry table. 


y=0 TABLE 57a [v=0.05 TABLE 57b 


È | Af | Az} f | Af | A’ 
2.500} —1.071 | 0.642 0.4 |2.487 | —1.068 | 0.644 
1.429 | —0 . 429 0.7 |1.419|—0.424 
1.0 |0 


y=0.10 TABLE 57c 


-995 





-456 | —-1 .056 | 0.637 
-400 | —0.419 
ea 





Since for these tables 
$ h =0.7— 0.4 =0.3 


then, putting x, =0.4, we have, 


_ Xš—X%% _0.5—0.4_ 1 
EER N 





whence, using Newton’s first interpolation formula, we successively 


570 Ch. 14. The interpolation of Functions 


obtain 
l (-5 
f =F (0.5, 0)=2.500—4. 1.071 + 24,34 . 0.642=2.072, 
j =F (0.5, 0.05) = 2.487—~-- 1.068— 5 - 0.644=2.069, 
fa=F (0:5, 0.10) =2.456—< - 1.056-—+. - 0.637-+2.033 


We now form a table of the values thus found (Table 58). 


TABLE 58 





0 2.072 —0.003 —0.033 
0.05 2.069 —0.036 
0.10 2.033 | 


Assuming k=0.05--0=0.05 and y,=0, we get 
, 0.03—0 _ 3 


1 =) 5 





whence 
3 a 2 ) 
F (0.5, 0.03) =2.072— =. 0. 003 + Ž— + (—0.033) = 2.074 


“14.25 DOUBLE DIFFERENCES OF HIGHER ORDER 


For a function z= F(x, y) given by a double-entry. table {z;;} 
we can define the partial finite differences 


Ås = 241, jiy and Ayz;;=2;, j+; 


Repeating these operations, we obtain double differences of higher 
orders: 


AMER Z = Ayn? pj = Agm (Apne 3) = Age (Axm 9), 


where we have set A®°*°z,,;=:2,;; For instance. 


A+ <= Ay (Agyi) = Ag (2), peg 22), pa $2) = 


= (Zis, jram Zin, jut Ziti, J (Zi jre— 22j, peak z;,) 


14.26 Interpolat. formula for function of two variables 571 


“14,26 NEWTON'S INTERPOLATION FORMULA 
FOR A FUNCTION OF TWO VARIABLES 
Using the differences of a function of two variables z= f(x, y), 
we can construct an interpolation polynomial similar to Newton’s 


interpolation polynomial. Let P(x, y) be an integral polynomial 
such that 


Azay P (Xos Yo) = A**" Zon (1) 


(m, n=0, 1, 2, ...). Suppose that P(x, y) is expanded in terms 
of the generalized powers of the differences x—x, and y—y; 
that is, 


P (x, Y) = Coo + Cry (¥— Xp) Cor (Y—Yo) + Coo (% Xp) ($ — x1) + 
+ Ci (XX) (Y— Yo) + Coe (Y—Yo) Y — h) + +> + (2) 
Putting =x, and y=y,, we have, by virtue of (1), 
P (Xo, Yo) = Zn0 = Coo 
We form first differences for P(x, y) to get 
? AP (x, Y) = Cioh + 2c,,h (x—X,) +e,,h(y—y,)+... 
an 
AyP (x, Y) = Co, + ey, (X — Xp) + cok (Y—Yo) +... 
whence, putting x =x, and y=y,, we have, on the basis of (1), 
AP (Xor Yo) = Alt Zoo = Cot 
and 
AyP (Xo, H) = A+ Zoo = Cork 
That is, 


= Al+0z,, re A®ttzy, 
Cio = h , 10 ~~ 


Then, computing the finite differences of second order for the po- 
lynomial P(x, y), we find 


AyP (x, y) = 2! Cph? + e...» 
Asy P (x, Y)=cuhk t a, 
AyyP (x, Y) = 2! ok + e 
From this we obtain, for x= x, and Y= Jy, 
Ag (Xo, Yo) = A+ Zo = 2! Coh?, 
AxyP (Xo, Yo) = ÂH Zo = Chk, 
AyyP (xos Yo) = At 29 = 2! Cok? 


(a 


572 Ch. 14. The Interpolation of Functions 


Thus 
1 A?+0z,, Al+1z5, 1 A®+2255 

wo gr ge i pg Oe 

The subsequent coefficients of the expansion (2) are found in a 

similar manner. Substituting the values of the coefficients thus 

found into formula (2), we get an interpolation polynomial for a 

function of two vartables: 


P(x, y= 20+ |e em) + on yo)| + 


$e [SG orm e ey) (yyy) + 


+S y) +o (3) 
When interpolating the function f(x, y) we assume 
F(x, y) œP (x, y) 
To simplify computations, it is common to introduce the variables 


= y— aa 


=p, a = q 





c 











Then 
ITA =p—l, Y— 4 
and so on. Then oar (3) becomes 
Z RY 2 yy + (PA! tZ + GA°t 259) + [p (p—1) A®*°z,5-+ 
F 2pqAat* Zoo F q (q— 1) AP+?2 55] +... (4) 


x=% ph, y=y+gr 
If one pitts p=0 or g=0, then (4) becomes the corresponding 
interpolation formula of Newton. 
“Example. Using the interpolation formula (4), find f= f (0.5, 9.03) 
for the function f(x, y) considered in the example of Sec. 14.24. 


Solution. Taking x,=0.4, y,=0, form tables of first differences 
(Tables 59a and 59b) for f. 


where 


“TABLE 59a 






Al+0F,; 












: Al+ Fy; 


i=0 —1.071 —0.429 
7] —1.068 —0.424 
j=2 —1.056 —0.419 


14.26 Interpolat. formula for function of two variables 573 
TABLE 59b 


| t=0 | f=] | f=2 





Aot+if,, | —0.013 | —0.010 | —0.005 
A9+1F | —0.031 | —0.019 | —0.014 


From this we find the second differences 
A®+°F,, = Al+°F,, — Al+ °F, = — 0.429—(— 1.071) = 0.642, 
AMF, = Att Aa —At+},, = — 1.068—(— 1.071) = 0.003 
or 
AIH o = APH —Atf,, = — 0.010 —(— 0.013) = 0.003, 
A+? o = At, A tF = — 0.031 —(— 0.013) = — 0.018 
Since 


_ koe l _ 9 yo 3 
SG gos Woe age ee 








then, using formula (4), we get 


f= 2.50044 -(— 1.071) +4 -(— 0.018) +4 ls: (—4) - 0.642 + 


+2 + 0.003 +5 -(—$)(—0.018)| = 





= 2.500 — 0.357 — 0.0078 — 0.0713 + 0 .0006.-}- 0.0021 = 2.067 


Comparing this with the answer f= 2.074 obtained by the first 
method, we see that the thousandths digits are unreliable. 


REFERENCES FOR CHAPTER 14 


[1] E. T. Whittaker and G, Robinson, The Calculus of Observations, 1944, 
Chapter 1, 

[2] V. L. Goncharov, The Theory of Interpolation and Approximation of Func- 
tions, 1934, Chapter I, Secs. 18-21 (in Russian). : 

[3] James B. Scarborough, Numerical Mathematical Analysis, 1955, Chapter IV, 
Part II. 

[4] V. M. Bradis, The Theory and Practice of Computations, 1935, Chapter IX 
(in Russian), S : 

[5] E. Milne, Numerical Calculus, 1949, Chapters IHI and VI. 

[6]. E. Ya. Remez, General Computational Methods of Chebyshev Approximation, 
1957, Part 1, Chapter I (in Russian). 

[7] N. A, Lednev (editor), Mathematical Practice Session Devoted to Computing 
Instruments and Machines, 1959, Chapter I1 (in Russian). : 

[8] V. N, Faddeyeva, Computational Methods of Linear Algebra, 1950, Chap- 
ter 111, Sec. 27 (in Russian). 


Chapter 15 
APPROXIMATE DIFFERENTIATION 


15.4 STATEMENT OF THE PROBLEM 


In the solution of practical problems, one often has to find de- 
rivatives of indicated orders of a function y=f(x) given in ta- 
bular form. It is also possible that due to the complexity of the 
analytic expression of the function f(x), direct differentiation 
would be involved. In such cases, one usually resorts k appro- 
ximate differentiation. 

To derive formulas for approximate differentiation we replace 
the given function f(x), on the: interval [a, b] that interests us, 
by an interpolating function P(x) (mostly by a polynomial) and 


then set 
F (x)= P (x) (1) 
aAx<b 


We do the same when seeking higher-order derivatives of the 
function f (x). 
If we know’ the error 


for 


R (x) =f (x)—P (x) 


in the interpolating function P(x), then the error of the deriva- 
tive P’ (x) is given by the formula 


r (x) =F (x)—P” (x) = R (x) (2) 


which means that the error of the derivative of an interpolating 
function is equal to the derivative of the error in that function. 
The same holds true for higher-order derivatives as well. 

It is well to note that, generally speaking, approximate diffe- 
rentiation is a less exact operation than interpolation. Indeed, 
the closeness of the ordinates of two curves 


y=f(x) and Y=P (x) 


on the interval [a, b] does not yet guarantee closeness of their 
derivatives f'(x) and P’(x) on that interval; that is, it does not 


45.2 Newton's first interpolation formula applied 575 


guarantee a small divergence of the slopes of the tangents to the 
curves at hand, given the same values of the argument (Fig. 65). 





Fig. 6 


15.2 FORMULAS OF APPROXIMATE DIFFERENTIATION BASED 
ON NEWTON'S FIRST INTERPOLATION FORMULA 


Suppose we have a function y(x) Specified at equally spaced 
points x; (i=0, 1, 2, ..., n) of an interval [a, b] by means of 
the values Yi =f (x). In order to find on fa, b] the derivatives 
y =F (x), y" =f" (x), ete.» we replace the function y by New- 
ton’s interpolation polynomial constructed for a set of points x,, 
Ny, -s Xg (RK). 

We have 


Dig— 
yi) =n + gay, + US ary, 4 2 a) 2) Asy, + 


Xa — 2) 3 
4 LAN G9) Ary (1) 
where 
an ae and h= Xin — x; (i=0, l, kaa) 


Multiplying the binomials together, we get 





2 33024-2 
y (x) = Yo + gA HA Ay + Et Ay + 





4— 673+ 1lq?—6 F 
pe een (1) 


Since 
dy _ dy dq wu 


dx dg dx h dq 


D Quite naturally, it- must be known beforehand that ne appropriate de- 
rivatives of f (x) exist, otherwise the computations will be illusory. 


576 Ch. 15. Approximate Differentiation 


it follows a 
pods 6q-+ 2 
y (x) = q [Ave + “AP yy ee AY, + 
298 — 9g? + Ilq—3 
+E aw+...| ©) 


Similarly, since 
y" _ dy’ Jea dy’) dq 


it follows that 
E 7 18g+11 
y" (=p p [Ate Hg an PEN wey ..] (3) 


li necessary, the same method may be used to compute the deri- 
vatives of any order of the function y (x). 

Note that when seeking the derivatives y’ ( y" (x)... ata 
fixed point x, one should choose for x, the ee tabular value 
of the argument. 

It is sometimes required to find the derivatives of y at basic 
tabulated points x; In this case the formulas of numerical dife- 
rentiation are simplified. Since each tabular value may be taken 
as the initial value, we put x= x, = then we have 





, ] A? A3 A’ 
s (x)= a e R.)  (a) 
and 
i lfag 11 5 l 
y (x)= (An — Am +75 Aut APY, + r .) (5) 
If P,(x) is Newton’s interpolation polynomial containing the 
differences Ay,, A?y,, ..., A*y, and 


Ry (x) = y (x)— P; (x) 
is the corresponding error, then the error in determining the de- 
rivative is 
Ry (x) = y (x)— P; (x) 
As we know (See. 14.15), 


(x— xo) (x — xy). (x — xp) y*n (y= 


Ry (4) = EFI 


—k 
= hkr Te z Pet } iran (E) 


where Ẹ is an intermediate number between the values x,, x,, .., 


15.2 Newton's first interpolation formula applied 577 


<--> X,, x. For this reason, assuming that y(x)€C%+2, we get 


, dR, d hk d 
Rie) = Ge B= gap A OE C 81+ 


+9(q—l1)...(q— KS pled 91} 





Furthermore, if we suppose i [y*+» (§)] to be bounded and take 

into account that i [g(g—1)...(g—A#)]gao=(—1)*e!, then for 
x= x% and, hence, for g=0, we will have 

Ri (%) = (—1)* piye G) (6) 

Since in many cases it is difficult to estimate y‘**» (§), for small 


h we set 


ARt+ly 
yen E) ~ err 





and, hence, 
P —])% AŻ +1 
Rj (%) = E a6 (7) 








In similar fashion we can find the error R,(x,) for the second 
derivative y” (x). 

Example 1. Find y'(50) of the tabulated function y= log, x 
(Table 60)... 


TABLE 60 
VALUES OF THE FUNCTION y = lO} % 





Solution. Here h=5. We supplement Table 60 with columns of 
finite differences (as usual, decimal orders are not indicated; they 
are determined by the decimal orders of the values of the fune- 
tion). 

Using the first row of the table; on the basis of formula (4), 
we get, to within third differences, 


y' 60=5( 0.0414 + 0.0018 + 0.0002) = 0.0087 


37 9616 


578 Ch. 15. Approximate Differentiation 


To estimate the accuracy of this value, note that since the 
above-tabulated function is y= log, x, 





M _ 0.43429 
Lo ee 


Y= 


Hence 





y (50) = 2: Dy 0087 


Thus, the results coincide to the fourth decimal place. 


Example 2. The path y =F (f) traversed in time ¢ by a point mo- 
ving in a straight line is given by the following table [1]: 


Time f; in sec Path y (t;) in em 


0.00 . 0.000 
0-01 1,519 
0.02 6.031 
0.03 13.397 
0-04 23.396 
0.05 35.721 
0.06 50.000 
0.07 65.798 
0.08 82.635 
0.09 100.000 


0 
l 
2 
3 
4 
5 
6 
7 
8 
9 





Using finite differences up to order five inclusive, we get the 
approximate velocity v=% and the acceleration WH=Se of the 


point for times f=0, 0.01, 0.02, 0.03, 0.04. 


Solution. Form the difference table (Table 61). 

Setting #=0.01 and applying formulas (4) and (5), we get 
approximate values of the velocity V (cm/sec) and arco station 
W (cm/sec?). For example, 


Y (0) = 100 (1.519 — 1.496 — 0.046 + 0.020 —0.001) = —0.4 cm/sec 
W (0) = 10.000 (2.993 + 0. 139 — 0.075 + 0.003) = 30,600 cm/sec? 
The values of V and W are given in Table 62. 


15.2 Newton's first interpolation formula applied 579° 


TABLE 6] 
FINITE DIFFERENCES OF THE FUNCTION y =f (£) 


| APY; | A?yi | Avi Atyi 


2.993 —0.139 — 0.082 —0.004 
2.854 —0. 221 —0.086 0.021 
2,633 —0. 307 —0.065 0.002 
2.326 —0.372 —~0.063 0.018 
1,954 —0.435 —0.045 0.014 
1.519 —0.480 —0.031 — 
1.039 —0.511 — 

0.528 — 


Oo NoDTJ e UN =O 





ee TABLE 62 
VALUES OF VELOCITY V AND ACCELERATION W ` 
FOR THE LAW OF MOTION y = Í (t) 


30, 600 30,462 
29,780 30,001 
28,780 28,625 
26,250 26,381 
23,360 23,340 











It will be noted that the tabulated law of motion is given by 
the formula 


y= 100( 1 ~ cos $5") 
whence f 
__ dy: 5000r _._ 50nd 
V= = > 5 n 97 
and 


dy __ . 250 0007? BOrt 


eas seal sae 





By way of comparison, the exact values, V and W, are given on 
the right-hand side of Table 62. 
It may be noted that it is also possible to derive formulas for 
approximate differentiation by proceeding from Newton’s second 
» interpolation formula. 


580 Ch. 15. Approximate Differentiation 


15.3 FORMULAS OF APPROXIMATE DIFFERENTIATION 
BASED ON STIRLING’S FORMULA 


The formulas of numerical differentiation derived in Sec. 15.2 
for a function y at a point x= x, have the disadvantage that they 
only employ one-sided values of the function for x > x,. A relati- 
vely higher accuracy is assured by symmetric formulas of differen- 
tiation which take into account the values of the given function 
y both for x> x, and for x< x These formulas are ordinarily 
called central formulas (formulas for central derivatives). We con- 
fine ourselves to the derivation of one of them and take Stirling’s 
interpolation formula as a basis. 
© Let ..., Pe Xos X-i Xo X» es se, be a set of equally 

spaced points with x;,,—x,;=/A the spacing and y; = f (x,) the cor- 
responding values of the given function y= f(x). Setting 


X— xo 


aan ae 


and replacing the function y approximately by the. Stirling inter- 
polation polynomial, we get 


—! 
y (2) = Yo gay | itg E raya + 
T7 
2 {n2 2—22 
ge E EE N (1) 


where for brevity we introduce the notation 


Ay- A 
Ay =% at Yo 
-7 
A3y_ Ay 
Ay 3 = y it eae 
Aby_ Ady 
Ay 5= yY at Y-2 
2 


and so forth. 
From formula (1), noting that 


15.3 Stirling's formula applied 581 








we get 
; 17 3g? 1 2g3— 
U(X) =a (Ay 1 gay te My s + Ay + 
2 2 
5g4— 15g? 4-4 ,. 3q° — 1093 -+- 4 
+ Ay 8 ep aye}, Q) 
2 


7 l A 62an: 
y w) = p(^Y +gAy_ 3 HA Aty_,+ 
2 





29° — 3 15g4 — 30g? 4-4 l ; 
+e ay 5 + ay st.) (2) 
In particular, setting q=0, we have. 
; I I l ; 
y (=g (AY A e Hga st 5: -) (3) 
2 2 2 
and 
a = I 2 l 4 I 6 , 
y =z (A Y-ı— jå Y-2+ 954 Y- + T (3’) 


Example 1. Find y’(1) and y"(1) for the function y=y(x) given 
‘by Table 63. 


TABLE 63 
VALUES OF THE FUNCTION y = y (x) 










0.7825361 


—86029 
0.98 | 0.7739332 —1326 
—87355 25 
1.00 0.7651977 ~~ 1301 L 
=88656 |. | 2% 
1.02 | 00.7563321 |. —1275 
—89931 







0.7473390 


Solution. Forming the differences for the function y (Table 63) 
and utilizing the underlined terms, we get, on the basis of for- 
mula (3), 


1 aes TEA 
1 2426 ..,, 1 WN 
4p. BEF 10-74 5-1 10 ja 


= —50. (88005.5 + 4.2-+ 0). 107? = —0.4400485 


582 Ch. 15, Approximate Differentiation 


As a check, note that the tabulated function is Bessel’ s function 
of order zero y= J, (x). 
As we know, 
J, (1) =— J, (*) |,21 = —0.4400506 


Similarly, using the twice underlined terms and applying for- 
mula (3’), we have 


ne l 2 1 _ 
y’ (1) = ggg + (— 1901-1077 75-1 10 "= 
= —2500. 1301 - 10-7 =—3.2525- 10-1 = —0.325250 


By way of comparison, we give the exact value obtained on the 
basis of relations between the Bessel functions: 
y" (1) = Ji (1) = J, (1)— Je (1) = 0.4400506 — 0.7651977 = —- 0.325147 

Thus, finding the second derivative numerically is, generally, a 
less reliable operatién than finding the first derivative. 

Note. It is sometimes required to find the extremum of a diffe- 
rentiable function y= y(x) given in tabular form. It is then ne- 
cessary that the equation y’(x)=0 hold at the point of the extre- 
mum x. Equating the derivative y’(x) to zero in formula (2), we 
find the appropriate value of q by the method of successive appro- 
ximations. From this, 

= x,+ Gh 
and the value of y is computed from formula (1) or from some 
other interpolation formula. The value of y thus found is the extre- 
mum of the function if the second difference A?y preserves constant 
sign in the neighbourhood of the point x. 
Example 2. Find the zero of the derivative of the function y= 


= J, (x) given by Table 64. 
We supplement Table 64 with finite differences of the function y. 


Solution. We take x, = 1.84. Using the underlined differences, we 
get, on the basis of formula (2), . 





0= 918— ag (= ate 
or 
0=97—1641g +34 
From this, 
=R tT? (4) 


15.4 Numerical differentiation for equally spaced points 583 


TABLE 64 
THE VALUES OF THE FUNCTION y = J; (x) 


0.5815170 


0.5817731 


0.5818649 


0.5817926 


0,5815566 





0.5811571 





Dropping the small nonlinear term, we get the first approximation: 


q? == 5.911. 10>? 


Improving this value, we obtain the second approximation from 
formula (4): 


g? = 9 + zg [0P] = 5.911- 107? + ag 8-494: 107° = 
= 5.911. 1072+ 3.2. 107° = 5.911.107? 
Hence we can put 
g = 0.05911 
whence 


x= x+ gh = 1.84- 0.05911 -0.02 = 1.8411822 


Thus 
J, (1.8411822) = 0 


15.4 FORMULAS OF NUMERICAL DIFFERENTIATION 
FOR EQUALLY SPACED POINTS 
Let the points xy, Xis Xa, <+>, Xp be equally spaced; that is, 
Xj4y—-X, =A (i=0, l, 2, s.. n—l) 


t 


and for the function y= y (x) let the values y; =y (x;) (=0,1,:..,n) 
be known. For this set of points x; construct Lagrange’s interpo- 


584 Ch, 15. Approximate Differentiation 


lation polynomial (Sec. 14.12): 


L, (y= Tn +a (%) yi 


izo (X— xi) Insi (xi) 


where 
TL,41 (4) = (x — to) (4—1) ... (xxn) 
Then 
L,(%)=y, (=0, l, ..., n) 
Putting 
my 
we get 
Upa (4) =A" +g (—1)...(g—n) = hr tigh +a 
and 


Wat (x)= (x;—%) ee (x; p— Xi- 1) (x; — Xiz) x (Xi — Xn) = 
=h"i (i—1)...1(—1)...[—(n—D] = 
= (iy! (n—i)! (1) 
Hence, for the Lagrange polynomial L, (x) we have the expression 


a —lyr-iy, giath 


L, (x)= a in! goi 


From this, noting that 





dx 
a" 
we get 
; (—])#- ly, gi@tu 
UO) LONE ka ai zi a a 


The higher order derivatives of the given function y(x) may be 
found in similar fashion. To estimate the error 


fa (x) =y' (x) — La (x) 


we take advantage of the familiar formula for the error of inter- 
polation formula (2) (Sec. 14.14): 


(n+1) 
Ry (9 =y (4) — L, (2) = AR Tyas (0) (4) 
where é=§ (x) is an intermediate value between the points x,, 
Xis +++, Xa and x. 
Assuming that y(x)EC"*, we derive 


ra (X)= Ri (x)= 
E E T 


15.4 Numerical differentiation for equally spaced points 585 


From this, taking into account formula (1) and assuming 
Es [y"*) (E)] to be bounded, we get the error of the derivative at 


the points: 





Ried =(—Iri SE yer) 


where € is a value lying between x, and x, 
I. Let us perform the computations for n=2 (three points). 
From formula is we get 


1 
L (9= 5 = Yo (9—1) (q—2)— yq (9 —2) +> 429 (FQ —1) 
‘whence, noting that ah, we have 
: , Ipil 1 
y (x) & L3(x) => | yoo (29-~3) — 4s (29 —2) +y Hol2Q~1)| 
In particular, for the derivatives 
y’ (x) =y';(i=9, 1, 2) 
we obtain the following i ee 
Yo = al — 3y, + 4y,—y,), 
: 
Yi = p (7h th), 
panil 
Ys = aR (Yo — 441 + 3y) 
with corresponding errors: 
ri =F ky” (Eo), 
n= a hy” (É), 
l ver 
ha=ahy (62) 
We give, without proof, the differentiation fouls for four 
and five points [3], the validity of which the reader can easily 


check by himself. 
II. n=3 (four points): 


; 1 As 
Yo = ors! ly, + 18y, —9y, + 24) — ee (5), 


he 
w= x (— 24) — 3y, + 6Y, — Y3) +7” (8), 


586 Ch. 15. Approximate Differentiation 


yn = SE (Yy— 8), + 3h + 20) E (4) (E), 


x eee rere 11y) + ay (§) 
HI. 4 (five points): 


hee 25y + 48y, — 36y, + 16y,—3y,) +F y™ (8), 
n=- 3o — 10y, + 18y, — 6y + y) — 15 y” È), 
=a o — 8y, + 8Y; —y +o" E), 

Ys = Ty; (—Yo + by) — 184, + 104, +34.) — Sy (8), 
yi = qaz (BYo— 184 + 36y, —48y, + 25y,) +A 9 (8) 


An examination of formulas 1 to II shows that if the number 
of points is odd and the derivative is taken at the midpoint, then 
the corresponding formula for numerical differentiation has a sim- 
pler expression and higher accuracy. , 

Below we give the formulas (for n = 2 and n = 4) for such central 
derivatives [3]; in order to exhibit the symmetry, we have changed 


Fig. 66 na gg ag 


the numbering of the points (Fig. 66): 


I. n=2. 
, I Bo a 
Y= op r) “ey E) 
where y; = y(x,;) and i= —1, 0, 1. 
Tl. n=4. 
eee 2 ) l ht HE 
Yo = gg Y- g H -2) + 9G Y 
where y; =y (x;) and i= —2,—1, 0, 1, 2. 


15.5 GRAPHICAL DIFFERENTIATION 


The problem of graphical differentiation consists in constructing 
the graph of the derivative 


Y=f (x) 
of a function y=f(x) on the basis of the graph of that function. 


15.5 Graphical differentiation 587 


We start with the graph of the function »=/(x) (Fig. 67). 
To construct (to a known scale /) the graph of its derivative, 
choose on the given curve a sufficiently dense set of points 1, 2, 3, 
4, 5,... including, if possible, the characteristic points of the 
graph. “By eye” construct tangents to the graph of the function 
at these points. Then, on the x-axis choose the point P(—4J,0) 


Fig. 67 





(pole) and draw straight lines parallel to the corresponding tan- 
gents Pl’, PX, P3, P4’, P5, ... to their intersection with the 
y-axis. The segments of the y-axis OL’, 02’, 03’, 04’, 05’, ... are, 
respectively, magnitudes proportional to the values of the deriva- 
tive y’ =f" (x) at the chosen points; that is to say, they are ordi- 
nates of the graph of the derivative. Indeed, to take an exampie, 
for point 1 in Fig. 67 we have 


OA=/ tan a, =1f (x,) 


We obtain similar results for all other points. Therefore, the points 
1”, 2”, 3”, 4", 5", ... of intersection of the parallels passing 
through the points I’, 2’, 3’, 4’, 5’, ... with the corresponding ver- 
tical lines passing through the points of tangency 1, 2, 3, 4, 5,... 
belong to the graph of the derivative y = If (x). 

By joining the points 1”, 2”, 3’, 4”, 5’, ... with a line the 
nature of which has regard for the positions of the intermediate 
points, we obtain an approximate graph of the derivative y’ to the 
scale 1. Ii we choose J=1, the graph of the derivative is then 
drawn to full scale. 


. 


588 Ch. 15. Approximate Differentiation 


To increase the accuracy of the graphical construction, it is 
advisable first to determine the direction of the tangent and then 
to give an indication of the point of tangency. To do this, sub- 
divide the graph of the function into small segments that differ 
but slightly from straight-line segments. Let us consider one of 
them, AB in Fig. 68. Construct a family of chords parallel to the 


Fig. 68 A 


secant AB. The-locus of the midpoints of these chords is the curve 
K intersecting the graph of the function at C, the tangent at C 
being parallel to the secant AB. Using this technique, we can find 
the point and the corresponding direction of the tangent on each 
line segment. The subsequent construction is carried out as indi- 
cated above. 

More detailed descriptions may be found in the special litera- 
ture (see, for example, [5}). 


“15.6 ON THE APPROXIMATE CALCULATION 
OF PARTIAL DERIVATIVES 


If a function z =f (x,y) is given on a rectangular grid 
x=x,+ 1h, Y= + jk 


(i, ¿=0, 1, 2, ...), then it may be represented in approximate 
tashion by the interpolation formula (Sec. 14.26) 


Z = Zw + [pA + Zo + gA** 25) + 
+ [P (p—1) A* "255 + 20g AMZ 59-+ 9 (g—1) A259] + 
l ; | Z 
+37 [p (p— 1) (p—2) APH ao + 3p (p— 1) ga? Zy + 
+ 3pq (q—1) A**?z, +g (9—1) (g—2) Atza] (1) 
where 
p= a, g= oan 


and A™*"z,,—AYmynz(0, 0) are mixed double differences. 


15.6 Approximate calculation of partial derivatives 589 


From formula (1) it is easy to find the partial derivatives 
Oz Ox dp 1 dz oz Oz dg = 1 oz 





ox Op dx h dp’ ðy 0g dy k ð 
and so forth. 


REFERENCES FOR CHAPTER 15 


1I] A. N. Krylov, Lectures on Approximate Computations, 
1954, 6th edition, p. 228 (in Russian). 
|2] James B. Scarborough, Numerical Mathematical 
Analysis, 1955, Chapter VII. 
13] William E. Milne, Numerical Calculus, 1949, Chapter IV. 
[4] Sk. E. Mikeladze, Numerical Methods of 
Mathematical Analysis, 1953, Chapter XII (in Russian). 
[5] Carl Runge, Graphische Methoden, 1919, I1 Kapitel, § 14, 


Chapter 16 
APPROXIMATE INTEGRATION OF FUNCTIONS 


16.1 GENERAL REMARKS 


If a function f(x) is continuous on an interval [a, b] and its 
antiderivative F(x) is known, then the definite integral of this 
function from a to 6 may be computed from the Newton-Leibniz 
formula ; 

| F (x) dx =F (b)—F (a) (1) 
a 
where F’ (x) =f (x). 

However, in many cases the antiderivative F (x) cannot be found 
by elementary means or is too involved; as a result, computation 
of the definite integral by formula (1) may be difficult or practi- 
cally impossible. 

Moreover, in practical situations, the integrand f(x) is often 
specified in tabular form and then the whole concept of an anti- 
derivative is meaningless. Similar problems arise in the compu- 
tation of multiple integrals. Therefirom stems the great importance 
of approximate, primarily numerical, methods for computing defi- 
nite integrals. 

The problem of the numerical integration of a function con- 
sists in computing the value of a definite integral on the basis of 
a series of values of the integrand. 

The numerical computation of a single integral is called me- 
chanical quadrature, that of a double integral, mechanical cubature. 
We will call the respective formulas, quadrature and cubature 
formulas. 

“Let us first examine the numerical computation of single integ- 
rals. The ordinary technique of mechanical quadrature consists in 
replacing the given function f (x) on the interval [a, b] under con- 
sideration by an interpolating or approximating function ¢ (x) of 
a simple kind (say a polynomial) and approximately setting 

b 


b 
\ Fæ) dx= f @ (x) dx (2) 


a 


16.1 General remarks 591 


b 
The function ọ(x) must be such that the integral | @ (x) dx can 


be evaluated eae 

If the function f(x) is given analytically, then the question is 
posed of estimating the error in formula (2). 

Let us examine in more detail the use, for this purpose, of the 
Lagrange interpolation polynomial (Sec. 14.12). 

Suppose, for the function y=/(x), we sae the corresponding 
values at n+1 points Xo, x,, %,,..-, Xp Of [a, b]: 


ed=o9p (i=0, 1, 2, ..., n) (3) 
It is required to fina approximately 
b 
{ydx=} F (x)dx 
a 


a 


Using the given values y;, construct the Lagrange polynomial 


Hp (x) 

L, (x) = Locman m" (4) 
where 
i IL, 41 (x) = (x —Xp) (x— x) ies (x — xp) i 
and : 

La (x) =y; (i=0, 1,2, ... n) 
Replacing the function f(x) by the polynomial L,(x), we get 


b b 
\ F(x) de= | L, (œ) dx +R, [f] (5) 


where R, [f] is the error in the quadrature formula (5) (remainder 
term). From this, using (4), we get the approximate quadrature 
formula 


b n 
\ydx= Y Ay; (6) 
a i=6 
where 
i (x) 
— Hra x fe ioe 
iela a r E dx (i=0, 1, 2, ..., n) (7) 


a 


If the limits of integration a and b are interpolation points, 
then the quadrature formula (6) is of the “closed type”, otherwise 
it is of the “open type”. : 

With respect to computation of the coefficients A;, note that 


592 Ch. 16. Approximate Integration of Functions 


(1) the coefficients A, are independent of the choice of the func- 
tion f(x) for a given arrangement of the points; 

(2) for a polynomial of degree n, formula (6) is exact because 
in that case L,(x)=/ (x); hence, in particular, formula (6) is 
exact for y=x* (k=0, 1, ..., n) that is, R,[{x*]=0 for 
k=0, 1, ..., A. 

Putting y=x* (k=0, 1, 2, ..., n) in (6), we get a linear sys- 
tem of n+ 1 equations: 


i 
(= 


a 
> 
I 
Ji 
th 
——" 


5 
ul 
° 


“Bae | (8) 


where 
b 


Iya \ x+ dx= 


pkt1__gk+1 


EFT (k=0, 1, ..., 7) 


from which it is possible to determine the coefficients A,, A,,..., A, 
[1], [2]. The determinant of system (8) is the Vandermonde 
determinant 


D= [] (x,—x;) «0 
i>] 


Note that in the application of this method the actual construc- 
tion of the Lagrange polynomial L, (x) is unnecessary. 

A simple method for computing the errors of quadrature for- 
mulas has been worked out by S. M. Nikolsky [3]. 


Example. Derive a quadrature formula of the form 
1 
l 
(ydx = Aw (7) +4y (=) +4 (F) ©) 
0 ` 
Solution. Putting 


y= x" (k=0, 1, 2) 
in (9) and noting that 


w| = 


1 1 1 
f dr=1, Sedum, f #dr= 


16.2 Newton-Cotes quadrature formulas 593 


we get the system 
1=A,+A,+A4,, 


1 l l 3 
y=g ty Atg Ao 
] l l 9 
z = l +y 4 tig As 
whence 
2 I 2 
A =F. 4 = 7, A4 =35 
and, thus, 
1 
“2 I I l 2 3 
\ydem—zu(a)—su (r) +3 ÈT) n 


9 
The quadrature formula (10) is ọf the open type and is exact 
for all polynomials of degree not exceeding twò. It is easy to see 
that (10) yields a correct result for y—x? as well, and so this 
formula is also exact for third-degree polynomials. 
16.2 NEWTON-COTES QUADRATURE FORMULAS 
Suppose, for a given function y=f(x), it is required to com- 
pute the integral 
b 
| ydx 
a 


Choosing a spacing 





divide the interval [a, 6] by means of equally spaced points 
Xp =A, X;=Xy+ih (i=1, 2, ..., n—1), x,=0 
into n equal parts, and let 
y; =f (x;) (i=0, 1, 2, ..., n) 


Replacing the function y by an appropriate Lagrange interpo- 
lation polynomial L, (x), we obtain the approximate quadrature 
formula 


xn 


| yds= $ Aw, (1) 


Xo 
where A; are certain constant coefficients, 


3A 9616 


594 Ch. 16, Approximate Integration of Functions 


We derive explicit expressions for the coefficients A; of formu- 
la (1). As is known (Sec. 14.12), 


n rs 
y= È vile (2) 
where Š 
—Xp) Eee we (X — Xp) (X— Xp 44)... (L— Xn) 
-Pi (Q= G = Xi Ki) o (Xp Xis) (Xj — Hj 41) ©. (Xi — Xn) (3) 


Introducing the notation 
a q= A (4) 
and 
l g7 =q (q— 1)... (q — n) (5) 
we obtain fcf. formula ams of Sec. 15.4] 


(—1)" -i ogen 
siaii g—i 7i 


L, (x) = (6) 
Replacing in (1) the T y by the polynomial L, (x), we 

obtain, by virtue of (6), 

(—1) -f gint il 


i(n—D!  q—i 








or, since 


— dx 
q= x 5a f dq =~ 





then, by a change of variables in the definife integral, we get 





A =h yr] a “EO Bi aa) 


il (n—i)! q—t 
0 


Since : 





we ordinarily put 








A; = (b—a) Hı 
where 
n 
] (—lyrmé f q+ . 
i Tar | q—i ag Eep i aef n 


are constants called Cotes coefficients (see, for example, [1], [4]). 


16.3 Trapezoidal formula and its remainder term 595 


Then the quadrature formula (1) assumes the form 


b 


Í ydx = {b —a) È Hyi (8) 


where 


—4 


p=? 





and y;=f;(a+ih) (i=0, 1, ..., n) 
It is easy to see that the following relations are valid: 


(1) È H=], (2) H;=H,-; 
i=0 


16.3 THE TRAPEZOIDAL FORMULA AND ITS REMAINDER TERM 


Applying formula (7) of the preceding section, we have, for 
n=l, 


I 


a=) gg 


g 


= 
l 
| 
Q 





ol 


1 
l 
f H,=\qdq=5 
0 


whence 





| jde=4 ntn) (1) Fig. 69 


We thus obtain the trapezoidal formula for approximate compu- 
tation of a definite integral (Fig. 69). 
The remainder term (error) of the quadrature formula (1) is 
wy 
R=\ydx—F y+) 


Xo 


_ Assuming that y €C® [a,b], we derive a simple formula for the 
remainder term. We will regard R=R(h) as a function of the 
spacing h; then we can put 


Xoth 


Rit= | yds—5 ly (t) +9 h] 


Xo 


Differentiating this formula with respect to h twice in succession, 


596 Ch. 16, Approximate Integration of Functions 


we obtain 


R (h= 9 (% +A) — y) +9 (fo EM 5 9 (t+ A) = 


= tly +h) —y lin  +A) 
and 
R' (h) =F! (+N FY (Xp +h) ZY (%) +h) = 


A n 
= —7 y (x) +A) 
Note that 
R(0)=0, R’(0)=0 
From this, integrating with respect to A and using the mean- 
value theorem, we successively derive 


h h 
R'(h)=R' 0)+ È R A di=— 4 | ty (+0) dt = 
0 0 


h 
= sy" &) Jidt= —F 9" &) 
0 
where §,€(x,, x+ A) and 


h h 
R(h)=R(0)+ È R A di= —7 | Py’ @)dt= 
0 0 


A 
Tis h n 
= —- y’ (È) j pdt = — y" &) 

where &€(x,, x+h). 

Thus, we finally have 

h p, 
R=—5 1 6) 

where E€(x,, x,). 

From the foregoing, it follows, in particular, that if y” >0, 


then formula (1) yields the value of the integral with an excess, 
but if y° <0, it yields the value with a deficit. 


16.4 SIMPSON’S FORMULA AND ITS REMAINDER TERM 
From formula (7) of Sec. 16.2 we get, for n=2, 


H, =+ ETETE ($64 en 


16.4 Simpson's formula and its remainder term 


597 
1] R 
H =—5: T a2) d =, 
0 
oie 1 
H, =5 y Jag—l)dq = 
0 
Hence, since x,—x,= 2h, we have 
Xe h 
\ydx =F (m+ 4n +y) (1) 


which is the formula of Simpson’s rule. Geometrically, this for- 
mula is obtained by replacing the given curve y=f(x) by the 
parabola y = L, (x) passing through three points M, (Xo, Yo), M, (Xis 91) 
and M,(x,, Ya) (Fig. 70). 


Fig. 70 





The remainder term of Simpson’s formula is 
Xg h 
R= \ydx—F (Yt 4+.) 


Assuming that y¢€C™ [a, b], we derive a simpler expression for R 
in a manner similar to what was done for the trapezoidal formula. 
Fixing the midpoint x, and regarding R= R (h) as a function of 
the spacing A(z 0), we get 

x,+h 


R= | yde—F [yb —h) + 4y (mi) ty + A] 


4-8 


598 Ch. 16. Approximate Integration of Functions 


whence, differentiating the function R (h) three times in succession 
with respect to hk, we obtain 


R (A) = [ya +h) + y (4, Al] 5 [y (% — h) + Ay (4) +Yy (x, +4)] — 
Sg PG Wee (ee NSS aeea 

—< ya) È [=y i — A ty Oa FA], 
R (=F ly (aA) ty atA) 


5-9 ahy ey. y (4 —A)+y" (4, +h] = 


=4 [=Y ahy FMF Gah) ty" a +A), 


R" (h= ly ba —h) ty" ba +A) — 


sly ah) (eta Sly" (a ty” (4 t= 


= [y (a th) —y'" (4 A] = — yv &) 


where —,€ (x,—h, x, +A). 
Moreover, we have 
R(0)=0, R’(0)=0, R (0) =0 


wre 


Successively tefra ting R 
we find 


(A) and using the mean-value theorem, 
h 5 h 

R' (h) =R" (O) + R” (Nat = —3 | fy Godt = 
b ö 


h 
2 = , 
= Fy" G) ledi= 5 ary E) 
0 


where E, E (x, —h, x + h), 


; h h 
R' (h) =R' (0)+ È R' (f) dt = —=5 f PYV (&,)dt = 
0 


= oy" | at — 5 hiy! &,) 





16.5 Newton-Cotes formulas of higher orders 599 


where ©, € (x, fA, x, +A), 
h 


h 
R(h)=R(0)+) R (H di= i | Hy" Edt = 
0 0 


=— p ® f rad= 5 © 


where ẸE (x, —h, x+h). 
Thus, the remainder term of Simpson’s formula is 


R= — 5y” È) (2) 


where £€(x,, x,)- 

Hence, this formula is exact for polynomials not only of degree 
two but of degree three as well, which means that Simpson’s rule 
has a rather high accuracy for a relatively smal] number of ordi- 
nates, 


16.5 NEWTON-COTES FORMULAS OF HIGHER ORDERS 


Carrying out the appropriate computations for n=3, we obtain 
from formula (7) of Sec. 16.2 Newton’s quadrature formula 


p 3h 
\ yde= F (+3 + 39+ Ya) (1) 


(three-eighths rule). 

The remainder term of formula (1) is [2] 

3h 

| R= — I" @) 
© where EE (x, x3) 
The subsequent Newton-Cotes quadrature formulas are given 
-in [1], [2]. The remainder terms of these formulas are given by 
Steffensen (see [1], [5], [6]). 

Note that for sufficient smoothness of the function y= f(x) the 
error in a Newton-Cotes formula employing n-+1 ordinates is at 
least of the order of [1], [6] 


geo | 


where E(4) is the largest integer in the fraction >: 


From this we see that quadrature formulas employing an odd 
number of ordinates are more advantageous as to the degree of 


600 Ch. 16. Approximate Integration of Functions 


accuracy. See the table of Cotes coefficients (Table 65). For the 
sake of notational convenience, the Cotes coefficients for each n 


TABLE 65 
COTES COEFFICIENTS 





F Common 
n A, A, A, Hy A, a A, A, A, aa a 
l 1 l : 2 
2 l 4 1 6 
3]; 1] 38 3 1 8 
4 7 32 12 32 7 90 
5 19 75 50 50 75 19 288 
6] 41 216 27 | 272 27 | 216 4] 840 
7 | 751 3577 | 1323 | 2989 2989 | 1323 | 3577 75! 17280 
8 | 989 | 5888 |—928 |10496 | —4540 |10496 |—928 | 5888 | 989 | 28350 





are given in the form of fractions: 

Â; 

N 

with common denominator N. Note as a check that 


H,= 


It is worth bearing in mind that the Cotes coefficients for large n 
may be negative (see, for instance, n= 8). 


Example. Evaluate 


ao (xt l+x 


using a Newton-Cotes formula employing seven ordinates (n= 6). 
Solution. Taking a spacing of 


e 


palet 


6 


o| — 


we form a table of values (see Table 66) where for convenience 
we’ assume H,;= 840 H; 
Whence 


l=z5' 581.994372 = 0.6933 


16.6 General trapezoidal formula (trapezoidal rule} 601 


TABLE 66 
EVALUATING AN INTEGRAL BY THE NEWTON-COTES FORMULA 















0 l 4l 
1 6 

l < = 216 185. 142857 
1 3 

2 5 > 27 20.25 
1 2 

3 7 7 272 181.333333 
2 3 

4 7 5 27 16.2 
5 6 

5 5 ii 216 117.818182 

l 
6 l = 4l 20.25 





581.994372 


The exact value is 


I =in2 = 0.69315 ... 


Since the Cotes coefficients are extremely involved for a large 
number of ordinates, a practical procedure for approximating de- 
finite integrals is as follows: the interval of integration is divided 
into a sufficiently large number of subintervals to each of which 
is applied a Newton-Cotes quadrature formula employing a small 
number of ordinates (see, for instance, [7]). Formulas of simpler 
structure are then obtained and the accuracy of these formulas 
may be arbitrarily high. 

We will now consider some examples of such formulas. 


16.6 GENERAL TRAPEZOIDAL FORMULA [TRAPEZOIDAL RULE) 


To evaluate the integral 


b 
fydr 


602 Ch. 16. Approximate Integration of Functions 


divide the interval of integration [a, b] into n equal parts [xə x1], 
[x,, x2], ---> [Xp-1, Xn] and to each apply the trapezoidal rule 


[see Sec. 16.3, formula (1)]. Setting ha 


n 
y= f(x) (=0, 1, ..., n) the values of the integrand at the 
points x; we have 





b 
(yd =F (Wett tuH HE Wai E) 


or 
ae (Batts Hyn + Yn +) (1) 


Geometrically, formula (1) is obtained by replacing the graph 
of the integrand function y= f(x) by a polygonal line (Fig. 71). 


Fig. 71 





lf y€C® [a, b], then the remainder term of the onadakute; for- 
mula (1) is, by (2) of Sec. 16.3, 


R= fyd Pes ie Phyo 


i=] 
-E[ 5 i ydx— 4 ip i+w |= -55 sE (2) 


where £/€ (x;_,, %;)- 
Consider the arithmetic mean 


wed Dv en (3) 


16.7 Simpson's general formula [parabolic rule) 603 


Clearly, u lies between the smallest value m, and the largest 
value M, of the second derivative y” on the interval [a, 6]. Thus 
. m<u<M, 


Since y” is continuous on [a, b], it assumes all intermediate va- 
lues between m, and M,. A point &€ [a, b] therefore exists such 


that 
p= f" (§) 
From formulas (2) and (3) we have 


a 


hë M b DW: k? H 
R=— ey =E RE yw 


where $€ [a, 8]. 


16.7 SIMPSON'S GENERAL FORMULA [PARABOLIC RULE) 


Let n= 2m be an even number and let y;=f (x;) (i=0,1,2, ..., n) 
be the values of the function y=f(x) for equally spaced points 
=X, Xis +--+, %, =b with spacing 


ha b—a b—a 


n 2m 








Applying Simpson’s rule [Sec. 16.4, formula (1)] to each doubled 





Fig. 72 


Lime Lam 


interval [Xy xo], [xz Xa], +++» [Xem-2> Xem] Of length 2h (Fig. 72), 
we get 


b 
[yds=$ ytd tuts m+n): 


> 


os +5 (Yam-s + 4Yem=1 + Yom) 


604 Ch. 16. Approximate Integration of Functions 


whence we obtain Simpson’s general formula: 


b 
| yax =E [lyo + Yon) +4 (Ys Hyt... + Yona} + 


+2 (yty ---+Yom-2)] (1) 
Introducing the notation 


Op =H t+ Yt -e + Yem- 
O = Ya + Yat -> F Yom 


we can write (1) more simply as 


h J 
| ydx =- [o + 4,)+ 40, + 26%] (1) 
li yEC™ [a, b], then the error in See formula on every 
doubled interval [x,,_,, Xz] (k=l, ..., Mm) is given, on the 
basis of formula (2) of Sec. 16.4, by the formula 
As 
re =— og UY (Ex) 


where &€(x,,-5, Xo). Summing all these errors, we get the re- 
mainder term for Simpson’s general formula in the form 


he ~ 
R=— Dy &) 
k=1 
Since y (x) is continuous on fe, b] there is a point €€[a, b] 
such that 
l m 
YY E =- Dy” &) 
k=1 
We therefore have 


R= r" = Sy 


y™ (&) (2) 


where $€ fa, b]. 
If the maximum admissible error e > 0 is given, then, denoting 


M, = max |y"! (x)|, 


we will have, for determining the spacing h, the inequality 


hå 
(b—a) J80 M, < € 


16.7 Simpson’s general formula [parabolic rule) 605 


whence 


4/180 


eS (6—a) (6—a) My 


Thus, A.is of the order of y€. 

In many cases, it is extremely difficult to estimate the error in 
Simpson’s quadrature formula (1) using (2). Then a double com- 
putation is carried out with spacings A and 2A, and it is taken 
that the coincident decimals belong to the exact value of the - 
integral. 

There is another technique of practical convenience for compu- 
ting the error in Simpson’s quadrature formula. Assuming that 
the derivative yY (x) varies slowly on the interval [a, b] we obtain 
by (2) an approximate expression for the desired error 


R = Mh’ 
where the coefficient M will be considered constant. Let 2, and 


x, be approximate values of the integral 
b 


{= S ude 


obtained, by Simpson’s rule with spacing A and H=2h, respec- 
tively. We have 


i E Zp + Mhe 
and 
I= Zy 4+ M (2h) 
whence 
Bp— 
R= — T5 cs 


For the approximate value of the integral / it is advisable to 
take the corrected value 


5 


FS 
ae ca eae 





Note that if the number of divisions n is a multiple of 4, then 
the sum 3, may be computed by using tabular values, taking 
every other one. 


Example. Using Simpson’s formula, compute the integral 
í dx 
fe \+x 


taking n= 10. 


606 Ch. 16. Approximate Integration of Functions 
Solution. We have 2m = 10, whence 
+ 
1—0 
h=—-=0.1 


The results of the computations are given in Table 67. 
TABLE 67 
EVALUATING AN INTEGRAL BY SIMPSON FORMULA 


Ys; 


| Ya = 1,00000 | 


0.83333 


0, 90909 


0, 76923 

0.71429 
0, 66667 

0.62500 
0.58824 

0.55556 


| 0.50000 = y, | 
[3.45955 (01) | 2.72818 | (o2) 


0 
1 
2 
3 
-4 
5 
6 
7 
8 
9 


0.52632 


O CANAAN hWN 


© 


M |. 





By formula (1^) we have 
h i 
T = (Y + Yn +40, + 20,) = 0.69315 (3) 
Let us compute the error of the result (3). The total error R 


is made up of the error R, of operation and the remainder term R,. 
Clearly, 


R, => Ag 
i=0 


where A, are the coefficients in Simpson’s formula and e is the 
maximal rounding error for the values of the integrand. 
In our case 


R, = nhe = (b—a)e= 1-4-1075 =0.5- 1078 


16.8 On Chebyshev’s quadrature formuła 607 


We estimate the remainder term by formula (2). Since 





y= etx) 


hen 
YY = (1) (—2)(—3)(—4) 0 +5 = 
whence 


max|y'¥|=24 for O< x <1 
_ and, therefore, 
(0.1)4 


io 24=1.3-1075 





Thus, the limiting total error is 
R=0.5-107'+ 1.3-107>=1.8-10-* < 0.00002 


and, hence, 
I = 0.69315 + 0.00002 


16.8 ON CHEBYSHEV’S QUADRATURE FORMULA 


Let us consider the quadrature formula 
1 
| fat = > Bef (t) (1) 
-ł 


where B; are constant coefficients. 
Chebyshev suggested choosing the abscissas f; so that: 
(1) the coefficients B, are equal, 
(2) he quadrature formula (1) is exact for alt polynomials of 
degree up to n inchusive. 
Let us show how the B, and #; can then be found. Setting 
BS Bog =B,=B 


n 


and noting that for f(t)= 1, we have 
= ZB 
f=] 
whence we obtain 


Baz 


n 


Conséguently, Chebyshev’s quadrature ne is of the form 


l 
[FO TSN (2) 
=! 


608 Ch. t6. Approximate Integration of Functions 


To determine the abscissas ¢,, note that formula (2), by Condi- 
tion (2), must be exact for functions of the form 


equations 


fOj=t, 2, ..., 0 
Substituting these functions into (2), we obtain the system of 
titt t... tta =O, ) 
ae ees eer | 
H++...4+18=0, 
; Lg 
itht...+¢h=s, 
n n n__ n [1 —(—1)"+}] | 
es ala. ) 
from which we can determine the unknowns ¢, (i=1, 2, ..., n). 


Chebyshev demonstrated that the solution of system (3) reduces 

to finding the roots of a certain algebraic equation of degree n 

[6], [8]. Table 68 lists the values of the roots £; of system (3) 
7. 


for n= 2, 3, 


e.. 


TABLE 68 


VALUES OF ABSCISSAS 7; IN CHEBYSHEV’S FORMULA 


a 


1; 
1: 
2 

|: 
2; 
l 
2; 
3 


F0.577350 
+F0.707107 


F0.794654 
F 0.187592 
F 0.832498 


F0.866247 
F0.422519 
F0.266635 
F0.883862 
F0.529657 
70.323912 


F0.374541 0 








As S. N. Bernstein demonstrated, the system (3) for n=8 and 
n= 10 does not have any real solutions. Therein lies the funda- 
mental defect of Chebyshev’s quadrature formula. 


Exampel 1. Derive Chebyshev’s formula with three ordinates (n = 3). 


16.8 On Chebyshev’s quadrature formula 609 


Solution. We have the aoe system of equations for deter- 
mining the abscissas ¢; (i=1, 


pipet et 


+EH], (4) 
+t t= 


Let us consider the symmetric functions of the roots 


Ci=h Htt t, 
Catt + hits + taty 


C; = hitst 
From system (4) we have 
C, =0, 
= 
= Flt hth =O Da. 
a +4,+4,)%—3(t,+4,+4,) (t t tt 


+24 ++] =g (0—0+0)=0 
From this we conclude that ż; are the roots of the auxiliary 
equation 


=, + C,f£—C,=0 
oT 


Hence, we can take 


Thus, the corresponding Chebyshev formula has the form 


froa-$[1(- 7a) tho+t (73) 


To apply the Chebyshev quadrature formula to an integral of 


the form i 





[Fax 
a 
it is necessary to transform it by the substitution 
pu “te +754 


39 9616 


610 Ch. 16. Approximate Integration of Functions 


which carries the interval a< x<b into the interval -l!<¢<l 
Applying Chebyshev’s formula (2) to the transformed integra}, 
we get 





(5) 


Bt ea 5 


f (x) dx = 


where 


b+a aa m ti (6) 








X= 


and ¢;(i=1, 2, ..., n) are the roots of system (3) (given in 
Table 68). 
Chebyshev’s quadrature formula is mostly used in shipbuilding. 
Example 2. Evaluate the integral 
1 





D 


using Chebyshev’s formula with five ordinates (7 = 5). 
Solution. Introducing the notation 





we have 
I= [F a) HE A) HE HF a) SICA) 
where, by formula (6), 
mets eh 
sx Waly 
ais => +4: 0=0.5 


x,=1— 1, 0.68797 
X,= 1 — x, = 0.91625 
The respective values of y;=/(x,) (i=1, 2, 3, 4, 5) of the integ- 


rand are listed in Table 69. 
From this we get 


++ . (— 0.83250) = 0.08375, 
$ E E 


sls 


=F 
a 
2 


N 


l=% - 1.5342 = 0.3068 


By way of comparison, we give the exact value of the integral to 
six significant digits: 
l = 0.306846 ... 


16.9 Gaussian quadrature formula 614 


TABLE 69 
EVALUATING AN INTEGRAL BY CHEBYSHEV’S FORMULA 


0.08375 
0.31273 


0.50000 
0.68727 
0.91625 





16.9 GAUSSIAN QUADRATURE FORMULA 


‘In this.section we will need some facts about Legendre poly- 
nomials, which are polynomials of the form 
l dz 
Pa (%) = gary Ga [(— D (n=0, 1, 2, sale) 
The following are the basic properties of Legendre polynomials [1]: 
G) P,Q)=1, P,(—=(— 1” (n=0, Í, ...), 
i 


(2) | P,(x)-Qy(x)dx=0 (k< n), where Q,(x) is any poly- 


-1 
nomial of degree & less than n; 
(3) the Legendre polynomial P,,(x) has n distinct real roots 
lying in the interval (— 1, 1). 
Listed below are the first five Legendre polynomials (their 
graphs are shown in Fig. 73): 


Py (x) =1, 
P: (x) =X, 
P, (x)= (3x1), 


P, (x) = = (35x*— 30x? + 3) 


8 

Let us now derive the Gaussian quadrature formula. We first 
consider a function y=/(t) specified on the standard interval 
[— 1, 1]. The general case can readily be reduced to our case by 
a linear substitution of the independent variable, 


612 Ch. 16. Approximate Integration of Functions 


We pose the problem as follows: how must one choose points 
his tg, .--, Ín and coefficients A,, A,, ..., A, so that the quad- 
rature formula 

I 


Jre dt = X aiio (1) 


is exact for all polynomials f(t) of degree N as high as possible. 


Fig. 73 





Since we have:at our disposal 2n constants ¢; and A; (i=1, 
2, ..., n), and a polynomial of degree 2n—1 is determined by 
2n coefficients, this highest possible degree is in the general case 
clearly equal to N=2n—1. 

To ensure equation (1) it is necessary and sufficient that it be 
valid for 


FS], b, P, nee, PD 


Indeed, setting 
2 


| Hat= Si Ad?  (k=0. 1, 2, ..., MH) (2) 
=] i=} 


and 


n=l 


i= 2, Cyt 


16.9 Gaussian quadrature formula 613 


we get 
‘Sc 


1 l gn- n 
| rod= S c, f Aae S OS at= 
-1 È 0 1 k=0 t=] 


J C= D Ad (ty) 


n 2n-1 
sA A, ae 
Thus, taking into account the relations 


=] 


aere fo2 lm for & even 
i 0 fork odd 


we conclude that to solve the problem [2], [3], [6], it is sufficient 
to determine f; and A; from the following system of 2n equations: 


fea 


a 8 ed aA (3) 


pa Apm = 0 


The system (3) is a nonlinear ‘system and its solution in the 
ordinary manner involves great mathematical difficulties. However, 
the following artificial device may be employed. 

Consider the polynomials 


f(t)}<t*P,(t) (k=0, 1, ..., n—1) 


where P,,(t) is Legendre’s polynomial. i 
Since the degrees of these polynomials do not exceed 2n—1, 
then formula (1) should, on the basis of (3), hold true, and 


i 
| #P, (t) dt = > AdtP, (t)  (k=0, 1, ...,n—1) (4) 
-1 
On the other hand, by virtue of the orthogonality property of 
Legendre polynomials (Property 2), the equations 
l 


| #P,(t)df=0 for k<n 


614 Ch, 16.. Approximate Integration of Functions 

are valid and therefore 
È AtiP,(¢)=0 (k=0, 1, ..., n—1) (5) 
i=] 


Equations (5) will definitely be ensured for any values A; if we put 
Payee ` (i=14, 2; ces ft) (6) 
Thus, to achieve the maximum accuracy of the quadrature formula 
(1), it is sufficient to take for the points ¢,; the zeros of the 
` respective Legendre polynomial. As is known (Property 3), these 
zeros are real and distinct and lie in the interval (—1, 1). 
Knowing the abscissas of ¢;, the coefficients A;(i=1, 2, ..., n) 
can readily be found from the linear system of the first: n equa- 
tions of system (3). The determinant of this subsystem is the 
Vandermonde determinant 
D= J] (t;—t,) #0 
i>} 

and, hence, A; are determined unambiguously. It may be, shown 
that the quadrature formula (1) with the thus obtained coefficients 
A, (i=l, ..., n) will be more exact for all polynomials of 
degree a higher than 2n—1, 

Formula (1), where the ¢; are zeros of the Legendre polynomial 

P(t) and the A;(i=1, 2, ..., n) are determined from system (3), 
is called the Gaussian quadrature formula. 

Example 1. Derive the Gaussian quadrature formula for the case 
of three ordinates (n= 3). 


Solution. The Legendre polynomial of degree three is 
P, (f) =} (6f°—31) 
Equating this polynomial to zero, we find the roots 
fs y= av — 0.774597, 
t,= 0, 
t= y È x 0.774597 


To determine the coefficients, A,, A,, A,, we have, by (3) 
_ A +4, +45=2, 
3 3 
-y $44 y 34,=0, 


3 3 2 
pAt+54.=3 


16.9 Gaussian quadrature formula 615 























whence 
f 5 8 
A — imaa = 
Therefore, i=A, yo 4&=3 
1 $ 
1 3 ae ay 
o frøoa=ily( -y $o. yE) 
= i TABLE 70 
ELEMENTS OF THE GAUSSIAN FORMULA 
n | i | ty | Ai 
! | I | 0 | 2 
2 11:2 0.57735027 | ! 
3 13 +0.77459667 2 =0..55555556 
2 0 -Š =0..88888889 
4/4; 4 0.861 1363! 0.34785484 
2. 3 0.33998104 0.65214516 
5 | 4; 5 =0.90617985 0. 23692688 
2 4 70753846931 0.47862868 
3; 0 0.56888889 
6 | 16 70.9324695] 0.17132450 
2 5 £0.66120939 036076158 
3; 4 £0.23861919 0.46791394 
7 4437 70.9491079} 0, 12948496 
2% 6 F0.74153119 0.27970540 
3 5 F0.40584515 0.38183006 
4 0 0.41795918 
8 | 18 0. 96028986 0. 10122854 
2:7 +0.79666648 0.22238104 
3: 6 £0 .52553242 0.31370664 
4: 5 F0. 18343464 0. 36268378 





616 Ch. 16. Approximate Integration of Functions 


For reference (see Table 70), we give the approximate values of 
the abscissas of f; and coefficients A, in the Gaussian quadrature. 
formula (1) for n=1 to 8 (see [1], [4], [6]). 

The inconvenience of the Gaussian quadrature formula consists 
in the fact that the abscissas of the points t; and the coefficients 
A; are, generally speaking, irrational numbers. This defect is 
partially compensated for by its high precision for a relatively 
small number of ordinates, 

Let us now consider using the Gaussian quadrature formula for 
evaluating the integral 








b 
{ F(x) dx 
Making a change of variable, 
gz orca 


we get 





[roat = pE sit) at 


Applying the Gaussian quadrature formula (1) to this last integral, 
we have 





(Foar = "5" BY AP) (7) 


where 
bta 
2 


ptt ty, (G1, 2, 0, 2) (8) 


and ź; are the zeros of the Legendre polynomial P,, (t); that is, l 
P,(t)=0 
The remainder term a ae Gaussian formula (7) with n points 
is expressed as follows [1], [6]: 
$ R ee eae 
o ((2n)t? (2n +1) 
whence we obtain  . 
1 /b— 
ras (ts) eo 
1 


Re= Term T) O 
Re = garam T) PO 





16.9 Gaussian quadrature formula 


i 


R, = 1237732650 


R, l 


and so forth, 


Example 2, Evaluate the integral 


617 


r ay w 


=a E 2 =)" me ($) 


I= | VTF ?xdx 
0 


using the Gaussian formula with three ordinates (n = 3), 


Solution. We have a=0 and b=1. By formula (8) and Table 70, 
the abscissas of the points will have the following values to five 


significant figures: 
] 











=> +44,=0.11270, 
x4 = 4 +5 l= 0.50000, 
x, = 4+ 5 f = 0.88730 
The corresponding coefficients of formula (7) will, in our case, be 
C, =A, =4-3=5=0.27778 
C, = A, = $= 5 0.44444 
C,=9524, 4 -2=5 = 0.27778 


The subsequent computations are given in Table 71. 


TABLE 71 
SCHEME FOR EVALUATING AN INTEGRAL BY THE GAUSSIAN FORMULA 














| k | 





l x Yi Ciyi 

! 0.11217 1, 10698 0.27778 0.30747 
2 0.50000 1.41421 0.44444 0.62853 
3 0.88730 1.66571 0.27778 0.46270 





| 1.39870 


618 Ch. 16. Approximate Infegration of Functions 
Consequently 
3 
= 2 Cy; = 1.39870 


To estimate the residual error R,, we can take advantage of the 
formula 


1 b— 
Re = ( y OG) WETE y G 
Supposing 


f(x) VIF E= (142x) 


we have 


pot AHDE 
9451-4 2x) F 

-*From this, f 

max |f®(x)|=945 for 0O<x<l 


945 /1\? l 
| Rs I< i370 (+ ) ~ 2000 


Note that the exact value of the integral is 


and, hence 


1=V3 —x ~ 1.39872 
16,10 SOME REMARKS ON THE ACCURACY OF QUADRATURE 
FORMULAS 


The quadrature formulas we have considered have the following 
structure: 


b n 
| F dx= D Aif (x) +R Ef] (1) 
where My Xa +++, Xa are a given set of points in the interval of 


integration la, 5], A, are some known constant coefficients, and R [f] 
is the remainder term. 


The accuracy of various quadrature formulas differs for one and 
the same number of ordinates. 


Example. Compare the accuracy of different quadrature formulas 
with three ordinates for the integral 


1 
"i VZFzdx=2V3—Ž =2,797435... 
a A 


16.10 Remarks on the accuracy of quadrature formulas 619 


Solution. Using Simpson’s formula, we get 
Px £[VI—1 +4 V 20+) ZF 1] =~ 8.428905 = 2.809635 
Chebyshev’s formula yields the result 


p= 3 y 2—12 vao V 24% | 
z 4.220097 = 2.813398 


Finally, the Gaussian formula gives the value 


I ~ 0.555566 (V 2—0.774597 +. V 240.774597 ) + 
+ 0.888889 V2 +0 = 2.797460 


Thus, in this case the Gaussian formula is the most exact. 

We confine ourselves to an examination of 
quadrature formulas with equally spaced points, 
these include the most common formulas: tra- 
pezoidal formula, Simpson’s formula, the 
Newton-Cotes formula. In this case, the accuracy 
of the quadrature formula is in the main cha- 
racterized by the order of the remainder term 


R =0 (h”) (2) 


where 





is the spacing (n is the number of subdivisions) 
and m is a natural number. For example, for 
the trapezoidal formula (Sec. 16.3) we have 


RIA =— 7S er & 


and som=2; for-Simpson’s formula (Sec. 16.4) 
we have 












R= & | 





whence m=4,. The larger the number m, the 
more exact the quadrature formula; in this 
sense, Simpson’s formula is more exact than 
the trapezoidal formula. The quality of a for- 
mula is revealed when we have a sufficiently 
small spacing A. Fig. 74 


COIL IOT gh ogni) 
COLELLO ELE ELE A 





620 Ch. 16. Approximate Integration of Functions 


From this it fcllows that in specific cases a crude quadrature 
formula cannot yield better results than a more exact one (for the 
same spacing). For example, for the function (Fig. 74) 

F(x) = — 84 45x? — 25x* 
we have 
1 
T= Í f(x)de=2(—8415—5) =4 


= 


For h=1 the trapezoidal formula yields the exact value 
1 4 1l 
1=al(-l+f(0)+-5/(1)=6—-8+6=4 


whereas Simpson’s formula for A=1 does not even ensure the sign 
of the integral: 


n=} F DH OHD] = 4 02-324 125-4 


For a fixed number of points, the accuracy of a quadrature for- 
mula depends essentially upon the location of the points. Greatly 
distorted values may result if the arrangement of points is not 
suitable. For instance, con- 
sider the function y=f(x) 
given in Fig. 75. Choosing 
equally spaced points a = xp, 
Xi» Xa, X3, ¥,=6 and employ- 
ing the appropriate Cotes 
formula for five ordinates we 
bel = F get 





, ; 
r= \P(xjdr<o 


whereas it is obvious that 
UETAN oom sa i I>0. 
Fig. 75 It is not very difficult to 
construct similar examples 
for any quadrature formula with an arbitrary number of ordinates. 
Generally, for a considerable number of zeros of the integrand 
f(x) or for a iarge number of its extrema [that is, when there 
are many zeros of the derivative f’ (x)], the accuracy of the quad- 
rature formulas is greatly reduced because of the unavoidable large 
values of the higher derivatives. Therefore, the spacing A should 
be chosen so that it is much less than the distances between adja- 
cent zeros of the function f(x) and its derivative f'(x). For this 
purpose, it is suggested to divide the basic interval of integra- 
tion fa, b] into subintervals [a, BJ], inside each of which the func- 


SENOS: 


46.10 Remarks on the accuracy of quadrature formulas 624 


tions f(x) and f'(x) preserve sign (if this is possible), and to 
evaluate the integral by parts, choosing the spacing for each subin- 
terval. In more complicated cases, one also has to take into con- 
sideration the behaviour of higher derivatives /™(x) (n>=>2). By 
way of general orientation, it is well to construct the graph of the 
integrand y=f(x) beforehand. If the function is a highly oscilla- 
ting one, it is best to employ special computational techniques. 
There are also general techniques which have been elaborated for 
increasing the accuracy of quadrature formulas [9]. 

To find the fotal limiting error of the quadrature formula (1), 
one must also take into account the summation error R,. Suppose 
the terms f(x,;}\(i=1, 2, ..., n) have been computed with an abso- 
lute error not exceeding e, and the coefficients A; of the quadra- 
ture formula are exact posi- 
tive constants. Then we can / 
put 


R<DAe=e D4; (3) 
t=1 t=1 


Since the quadrature formula 
(1) is valid for f(x)=1, 
b 


dx = b—a = $ A; 
a tt 





For this reason, from (3) we 7 
have ere 
Ry (b—aye (4) 

Consequently, the total limiting error of a quadrature formula,. 
without regard for the final rounding error, is 


R=(b—a)e+|R [f]| 


where |R[f]| is the error of method, which may be determined as 
indicated above. 

Note that if the integrand y=f (x) is given in tabular form by 
the values y;=f(x)(i=1, 2,..., n), then, strictly speaking, we 
cannot estimate the accuracy of the quadrature formula (1). This 
is because through a finite set of points M;(x,, y;) it is possible to 
pass an infinity of curves y = f(x) (Fig. 76) bounding various areas 
on the given interval [a, b]; that is, the integral 

BD a 
[= f f (x) dx 
can a priori have an absolutely arbitrary value (see Fig. 76). In 
this case, quadrature formulas may be used only if we have some 


622 Ch. 16. Approximate Integration of Functions 


idea about the unused intermediate values of the integrand function 
and its general properties, thus enabling us to get an idea of 
the nature of the graph of the function. 


“16.11 RICHARDSON EXTRAPOLATION 


If for the quadrature formula (1) of Sec. 16.10 we know the 
order of the remainder term R =R [f], then we can use the method 
of double computation to determine the magnitude of R. Let 

R=O(h") (m>l) 
where 





(n is the number of divisions), then we can approximately get 
R= Mh" (1) 
where M is a quantity which, for the given integrand f(x), we 


will take to be constant on the interval of integration [a, b]. We 
choose two different spacings 





b—a b— 
= m and h=- 
where n and n, (n;>n,) are the number of subintervals in the 
first and second cases, respectively. 
Denote by /n, and /,, the corresponding approximate values of 
the integral /. From ( (1) we have 
Ra = 1 In, = M (= 2)" (2) 


1 





and 





Ra =1— In, =M (E (2") 


Ne 
where Ra, and R,, are the appropriate remainder terms. Then 
mð l l 
Imn — In, = M (b—a) (a T] 
and, consequently, 
(y71,)" è Ine =a 
(b—a)” nt nt 


M= 





On the basis of (1) we get an expression for the remainder term: 


ning \m ! na! ny, 
R= ("H)' "Tm m 


i ng —ħAi 





16.11 Richardson extrapolation 623 


in particular, for h=A,, that is, when n=n, we have 
n” 


Stia ka 
Rma = nt n? (n In) (3) 


Using correction (3) and by virtue of (2’), we get the following 
improved value for the integral /: 


ig TAT a) (4) 
ng — nı 


This technique is called Richardson extrapolation [10]. Introduc- 
ing the notation 


2a (œ> 1) 


ny 
we have 
Digs tg In EB Ung Ln) (5) 


B= (6) 


where 


The coefficients B are tabulated for various values of a and m. 
Note that for the trapezoidal formula m=2 and for Simpson’s 
formula, m=4. A special case of formula (5) was given in 
Sec. 16.7. 

We will show that if Z n ln,» then /,,,, always lies outside 
the interval [Ins In]- 

Indeed, if 


Lalm 
then from formula (5) we have 
lann > In = MAX {fns Ln) 
But if 
In lig 
then from the same formula (5) we obtain 
Fags ng = Eng B Um ng) < In min {lno Fgh 
Thus 
Lanna € ny Ina] 


That is, /,,,,, is obtained from 7„ and /,, by means of an extra- 
polation operation, whence the name of the method. 


H In =l, then obviously 


Ían ny Ia T In 


na’ 


624 Ch. 16. Approximate ‘Integration of Functions 


It may be shown that for a sufficiently smooth integrand func- 
tion f(x) the order of the remainder term for /,,, is at least 
equal to m+1. 


Note. Tables 72a and 72b give examples of Richardson extrapo- 
lation. 


TABLE 72a 
EXTRAPOLATION FOR THE CASE OF THE TRAPEZOIDAL FORMULA 




















No Ia l, lens I 
n 

1 I= È sin x dx 1.575 | 1.896 | 2.004 | 2,000 
0 
2 

2 I= | e- dx 0.877 | 0.881 | 0.8823] 0.8821 
0 
Poi 

3 1a x inxdx 185.7090 | 179.5385 | 177.4819 | 177.4836 
3 
4 

dx 
4 el 0.9695 | 0.9389 | 0.9286 | 0.9267 | 
y 5=£ 

0 

No. | ey=l—ls | e=l-], 1,25! —Is,4 

l 0.429 0.104 —0.004 

2 0.0051 0.0011 —0. 0002 

3 —8 2254 —2.0549 0.0017 

4 —0.0428 —0.0122 —0.0019 





From these two tables it will be seen that, as a rule, the 
extrapolation increases the accuracy of computations for functions 
without singularities. 

It is also possible to derive more exact extrapolation formulas 
using the values /,,, /,, and J, of the desired integral. that cor- 
respond to the three distinct spacings 


b—a 


h,= 
s 7, 





(s=1, 2, 3) 


and taking into account the first two terms of the expansion of 
“the remainder term of the quadrature formula [10]. 


$ 


16,12 Bernoulli numbers 625 


TABLE 72B z 
EXTRAPOLATION FOR THE CASE OF SIMPSON’S FORMULA 


























faye i 
" l 
1 I= sin x de 2.094 | 2.004 | 2.010 | 2.000 
0 
1 
I f üx 33 78 5 
2 =) IFF 0.78 0.7853 | 0.7855 | 0.7854 
0 
7 
3 =} Pin x de 177.454 | 177.481 |177.483 | 177.4836 
3 
4 
4 =| oo | 9.0577 | 0.0541 | 0.0538 | 0.0533 
sle f ; ; : 
0 
No. eyelet, | Cg=l—l, 01, 2-1 fs, 4 
1 —0.094 —0. 004 —0.010 
2 0.0021 0.0001 —0.0001 
3 0.0296 0.0026 0.0006 
4 —0.0044 —0.0008 —0.0005 
"16.12 BERNOULLI NUMBERS 
Consider the function 
x 
Fo =z (1) 


Taking advantage of the familiar expansion 
x x x8 
=li titit 
we can write 


Pj= (2) 
titit Ibe tote. 


From this it is clear that the function f(x) can be expanded in 
a power series about x=0; for the sake of convenience in subse- 


40 9616 


626 Ch. 16. Approximate Integration of Functions 
quent computations, we represent this series as 
where B,=/(0)=1, In order > determine the other coefficients 


B, (n=1, 2, ...) of the expansion, which are called Bernoulli 
numbers, we make use of the identity obtained from (2): 


A B, 
Åge 


Multiplying together the power series and equating to zero the 
coefficients of positive powers of the variable x, we obtain an 
infinite system of linear equations: 


(3) 





HP 


a 





a=Q 


Ba- 1l oB 1l 
Be. tyeh ate +g? =l, 23...) 


7 n— i)! 
or, multiplying by (n+ 1)! and noting that 
sirem = Cot (R=0, 1, ... n+l) 

we get 

ChB, Ci BY pie HCl Bip TSO (4) 

If we agree to set 
B,=B* (5) 

then formula (4) may be compactly written in the following sym- 
bolic form: 


(B+ 1)? +1 — B"+1 +0 


or, replacing n+1 by-n, 
(B+ 1)"—B" =0 (6) 


Putting n=2,3, 4... in formula (6), we obtain an infinite system 
of equations: 


2B,+1=0, 

3B,+3B,+1=0, 

4B, 4-68, +4B, +1=0, 

5B, + 108, “+ 10B, +5B,+1=0, 


>. » © © è © e ù o o 


(7) 


wi 


16.12 Bernoulli numbers 627 


Whence, we successively find 
l 1 


B,=—+, B=, B,=0, B=—+, B,=0, 
B= 75. B,=0, B5, B,=0, By=&, 
By=0, Ba=— S, Ba=0, Bu=2, By=0, 
a= ee, By=0, By= Be, B50, 

and so on. 


Thus, the Bernoulli numbers may be determined step by step 
from the symbolic formula (6); note that after the binomial ex- 
pansion the powers of the B numbers must be replaced by Ber- 
noulli numbers with the appropriate indices. 

Function (1) is called the generating function of Bernoulli num- 
bers. Taking advantage of the notation of (5), we can write 
expansion (3) symbolically as follows: 

x X 
ex — ] =e 

It is obvious from the structure of system (7) that all the Ber- 
noulli numbers are’ rational. Besides, it turns out that the 
Bernoulli numbers of odd index, with the exception of B,, are 
equal to zero. We prove this property in the general form. No- 
ting that 

. B,=1}1 and = —1/2 


we have 
[ei B, 
=a Bt esate Lee (8) 
n=2 
Clearly 
AT o ake 
x(e*+1) _ x e*? +e ? x 
posre eg =g a 
e? —e? 


is an even function, and so its expansion (8) contains only even 
powers of the variable x and, hence, 


B,=0 for n=3, 5,7, ... 
Bernoulli numbers find applications in a diversity of problems. 


For instance, they are used in the important Euler-Mactaurin 
summation formula which we will now derive. 


628 Ch. 16. Approximate Integration of Functions 


“46.13 EULER-MACLAURIN FORMULA 
Let y= - fF (x) be a function defined in the domain xx, We 
consider the finite-difference operator 
Af (x) =f (x +h) —f (x) 
where h is a fixed positive quantity. The inverse operator = of the 


function f(x) is naturally taken to mean the function F (x) which 
satisfies the finite-difference equation 


AF (x) = f (x) (1) 

Thus, from (1) we have 
1 
F (x)= F(x) = Q) 
If f(x) is regarded on a set of equally snes points” 

Xos or is és = 
where Ax,;=%;,;—%;=h (i=0, 1, 2, ...), then the inverse ope- 
rator F (x)= f(x) can easily be constructed. Indeed, we form 
the finite sum 


i=1 
S(x)= Bie) =i har) 
and agree that S(x,)=0. We clearly get 


AS (x) =S (X41) —S (x) = F (x) (3) 
On the other hand, by (1) we have 
AF (x) =f (x) (4) 


Subtracting (4) from (3), we get 
A [F (x) —S (x)] =0 


for i=0, 1,2, ... . Here, the difference F (x;)—S(x,) does not de- 
pend on the subscript i and we can put : 


F (x;)—S (x) =F (x,)—S (xo) = F Xo) 
F(x) =F (x) +S (x) 


where F (x,) is an arbitrary constant. Thus 


| Pd) =F (%)+S (x) (5) 


whence 


That is to say, the inverse operator for a finite difference is the 
operator of a finite summation. 


16.43 Euler-Maclaurin formula 629 


Let us now introduce the differentiation operator 


D(x = 22 


The inverse operator + is to be understood in the sense of the 


operation of integration: 
l x 
pi) = SF dx 


Using the Taylor series, we find 
T pk jh Dk 
Af (x)= Dy E Eee }re=@r—nre) 
k=l k=l 


Hence 
A = (#P— |) 


Then, for the inverse operator =, we get the following expression: 
Lo 1 
A AD] 

Multiplying both members of this equation by AD, we get 

l kD 


hD% ehD—] 





In the right-hand member we have the generating function of Ber- 
noulli numbers and so 


1S tno 


a 


or, expanded, 
ala? fe] = Lge hy (x (6) 
k=0 


Integrating (6) from x= x, to x=x, and using formula (5), we 
obtain 


4 Fe) — 5 FG) = 
| =p | FO) det D GERE p=» (an) PG 


630 Ch. 16. Approximate Integration of Functions 


or 
n=l 


n=) 
F (%) RA F) — F (xo) =al (x;) = 


=z | Fleet DEA [PE (x,)—F*- (na) 
Xo k=l i 
Taking into account that 


B,=—> and Boa, =0 for c= 12: i 


we get the Euler-Maclaurin formula (also called the Euler-Maclau- 
rin summation formula) 


Xn 
| fade =a] FEDE H ED HE +5 E) 
% i 





SBa, FFP (apo Ra CO) 
k= $ 


‘where R,,, is the remainder term. The notation (7) in the form of 
an infinite series is not always legitimate since the series may di- 
verge. Substituting the values of the Bernoulli numbers, we get 


¢ Bate 
f ydx=h (yoy +4+ acess + Uns +5 Yn) — 75 Yr— He) + 
Ay: " 


oer ver 


As 
+ 55 We’ Ye") — g (Ys 9b) + 





Bom m m= m= 
eee a hm [POTO (xa) — F7? (%)] + Rom (8) 
The remainder term in the Euler-Maclaurin formula is of the form [6] 


Rin = nhs re nfem+a É) 
where Ẹ €E (xi 4n) 
The Euler-Maclaurin formula (8) is used for approximate evalu- 
ation of definite integrals and also for approximate summation of 
the values of functions for equally spaced values of the argument. 


16.13 Euler-Maclaurin formula 631 


Indeed, from (8) we have 


Liw=F j f (x) dx + ADEE Cn) y 
i=0 





+ 3 eet ETI [FORD (x) — FD (x,)] 


+ nher+2 m ar ferta (E) (9) 


Example 1. Using the Euler-Maclaurin formula, calculate the ap- 
proximate value of the definite integral 
1 
I= { (sin x—Inx-+e*)dx 
0,2 & 
Solution. Divide the interval [0.2, 1] into, say, eight subintervals 
taking A=0.1 and setting 
x;=0.2+i.0.1 (i=0, 1,..., 8) 
The results of the computations of the values of the function 
f (x)= sinx—lnx-+e* are listed in Table 73. 
TABLE 73 
VALUES OF THE FUNCTION Î (x) = sin x — In x + e* 


































| 0.2 0.3 0.4 | 0.5 0.6 
fae) | 3.02951 2.84936 2.79754 2.82130 2.89759 
| 0.7 | o s | o 2 | 1.0 | 
F(x) 3.01435 | 3. 16605 | 3.34830 3.55975 | 
} 





whence 
Ff (0) +B (a) + FF eg) H Foes) = 24.1894 
Confining ourselves to the fifth derivative, we have 
F (x) = cos s= + e*, 


Ph’ d= —cosx—5 +e" ; 


F (x) = cosx —4 +e 


632 Ch. 16. Approximate Integration of Functions 


Hence 
F (0.2) = —2.7985, F (1) = 2.2586, 
f” (0.2) = —249.7587 fF” (1) = 0.1780 
FY (0.2) = —74997.7985 FY (1) = —20.7415 


Substituting the values found into (8), we obtain 
I= 24.1894- 0.1 22 . (2.2586 + 2.7985) + 


+O (0.17804 249.7587) COP. (20.7415 + 


-++ 74997.7985) = 2.41894 — 0.00421 + 0.00004 = 2.41477 


Direct integration yields 
I = [—cos x—x(Inx—1) + e] [hp = 2.4148 
Example 2. Find the sum of 


1 l 1 1 
pre tage tage + + + oor 





Solution, In our case 


f=}, h=2, x =51, x,=99 


We find. derivatives of odd order of the function f (x): 


ff @=-, 
P (=—F, 
Vh@=—2, 
P(x) = — A, ete. 


Substituting into formula (9) and restricting ourselves to the seventh 
derivative, we obtain 


x=99 = 
1 
a-p g4 +7 (art) +s (a) 
1 l l 64/1 1 
T i (oat ) +3 i (sr a )—-t (sar) = 
= 0.004753416 + 0.000243490 + 0.000002169 — 
~- 0.00000000] = 0.004999074 


16.14 Approximation of improper intergrals 633 


By (9), where we have to put A=2, n=24, m=4, the error 
in the result is ` 


R=24.Q0. Fan - POE) < 24-20. 2. apm < sem © 10728 


16.14 APPROXIMATION OF IMPROPER INTEGRALS 
An integra] 


b 


| F(x) de (1) 


is called proper if (1) the interval of integration Ja, b] is finite, 
(2) the integrand f(x) is continuous on fa, 6], otherwise the 
integral (1) is termed improper. 

Let us first consider the approximate computation of the impro- 
per integral | 


f f (x) dx (2) 


with an infinite interval of integration where the function f (x) is. 
continuous over a <x <œ. 
The integral (2) is convergent if there is a finite limit 


b 
lim (F) dx (3) 


b> œ 
a 


and, by definition, we assume 
b 


( F) dx=lim | f(x)dx (4) 


Ii the limit (3) does not exist, then the integral (2) is diver- 
gent; such an integral is considered to be meaningless. Therefore, 
before attempting to evaluate an improper integral one must make 
sure, using familiar convergence tests [10], that the integral con- 
verges, 

To evaluate the convergent improper integral (2) to a given 
accuracy g, we represent it in the form 

0 b 
Ff layde = Si (ede f Fa (5) 


a a 


Since the integral converges, the number 6 may be chosen so 


634 Ch. 16. Approzimate integration of Functions 


large that the inequality 


D 


(Fax oe 


b 


(6) 








is valid. 
The proper integral 


r : 
f f (x) dx 
a 
may be computed from one of the quadrature formulas. Let S be an 
approximate value of this integral to an accuracy of 53 thus 


x)dx—S|< $ (7) 








From formulas (5), (6) and (7) we have 


<E 





[ir dx—S 
and so the problem can be solved. 
Suppose now that the interval of integration fa, b] is finite 
and the integrand f(x) has a finite number of discontinuities - 
on [a, b]. Since under our assumptions the interval of integ- 
ration may be partitioned into subintervals with a single 
discontinuity point of the integrand function, it suffices to exa- 
mine one case: when there is a single discontinuity point c of the 
function f(x) on fa, b], this point being of the second kind.» 
lic is an interior point of [a, b], then by definition we put 
b (os ) 
| f(x) de= lim E f fdz + f f(x)dx} (8) 
a oth c+, J 


DÐ If c is a point of discontinuity of the first kind, that is, there exist finite 
one-sided limits 


F(c—O)= lim f(x) and f(ce+0)— lim f(x), 
XO, XO XC, A> 


then we can put $ 


b c 
J Foy dem fnat | fe (x) dx 
where Fw) : a f(c4-0) if 
x i agx<e | = c+-0) if x=c 
h= į (c—0) ii x=c and fa (x -{ f (x) if cc xb 


The functions f,(x) and f, (x) are continuous on the intervals {a, c] and [c, b] 
respectively. Thus, our integral reduces to the sum of two proper integrals, 


16.15 Method of Kantorovich for isolating singularities 635 


and if the limit exists the integral is convergent, otherwise it is 
divergent. 

In similar fashion we define the convergence of the improper 
integral (8) if the discontinuity point c of the integrand f (x) coin- 
‘as a with one of the endpoints of the interval of integration 
a, bj. 
| In order to approximate, to a given accuracy e, the convergent 
improper integral (8), where the point of discontinuity c€ (a, b), 
one chooses positive numbers ô and ô, so small that the ine- 
quality 


cd 


c+, 


| f(x) dx 


c- by 


[A 
<7 








holds true. Then, using familiar quadrature formulas, one appro- 
ximately calculates the proper integrals 


b 


c—ô; , 
f Fi)de and | f(x)dx (9) 


c+6, 


It is clear that if S, and S, are approximate values of the in- 
tegrals (9) to within > then 


b 
|F) de æ S, +S, 


to an accuracy of e. lf the discontinuity point c of the integrand 
f(x) is an endpoint of the interval of integration [a, 6], then the 
computational procedure is modified accordingly. 


16.15 THE METHOD OF KANTOROVICH 
FOR ISOLATING SINGULARITIES 


A useiul device in calculating the approximate value of an in- 
tegral of a discontinuous function is the method of L. V. Kantorovich 
for isolating singularities [1], [6]. [10]. The underlying principle 
of this method is that we take out of the integrand f(x) a certain 
function g(x) having the same singularities as f(x), which is in- 
tegrable in elementary terms on the given interval [a, b] and is 
such that the difference f(x)—g(x) is sufficiently smooth on 
ja, 6]. For example, 


f()—g(x)EC™ [a, b], where m>1 


636- Ch, 16. Approximate. Integration of Functions 


We then have 
b 


b b 
| Fade = f g(x) dx + § [F (—g (x)] dx 


where the first integral is taken directly, and the second integral 
is readily evaluated by means of standard formulas. 

Let us consider the- use of this method for computing integrals 
of the form 





go) dx (1) 


(x — xo)" 
a 
where x, € [a, b], O0<a<1 and g(x) is continuous on [a, b]. 
Let ọ (x) EC™+D fa, b], that is, let (x) have continuous deri- 


vatives on [a, b] up to the order (m+ 1) inclusive. 
Using Taylor’s formula, we have 


g(t) = eae He) (x — x4) 4 ap (x) (2) 


where 





(k) (m+ 
pozo- FLV (ex, eE ayn) 
k=0 
[EE (a, b)] 
From this, we get, for integral (1), 


gp (x) dx Th g o 


b 
f (x—x9)* re 











f e—a- eae { Wy GF oes 


(k 
Ei g aa 47 


9 tt 


where 





b 
p (x) dx 
i=] S (5) 


From formula (3) it follows’ that 
oyo 
T 3 Fo ae eG; b] 


(at least!); hence, integral (5) is a proper integral and can be 
computed to any degree of accuracy by using the appropriate 
quadrature formula. 


16.15 Method of Kantorovich for isolating singularities 637 


The Kantorovich method is also applicable to improper integ- 
rals whose .integrands have several points of discontinuity of the 
type considered. In that case, to evaluate the integral it suffices 
to partition the interval of integration into parts containing only 
one singular point of the integrand function, and then take advan- 
tage of the additive property of integrals. 


Example 1. Calculate the approximate value of the improper in- 
tegral [11] 
1 





% z 
ia lyas (I=x) 


Solution. The integrand 
1 ! 
P(xy=x F(l—x) ? 
has one singularity x=0 on the interval |o, z| 


Expand the function 
1 


p(x) =(1—x) ? 


in a Taylor series in powers of x up to xt. Using the binomial 
theorem, we have 


g(x) = l+pe+3 Sp ett 


whence 
1 


7 
7 1d 2 
0 0 


3 


I l 
7 7 Be 
af ee 2. 
+59 dx+ig | x 
0 0 


- 


wl a 


1 
2 
5 715801 
dx + a athe = A590 V 2+1 = 
0, 





where. i 
L 
A 
n=| aa (7) 
9 
and 
5 5 
TEE tie): wom 


638 Ch. 16, Approximate Integration of Functions 


We compute the proper integral (7) by Simpson’s formula taking 
n= 10 and spacing h=y=0. 05. The results of the computations 


to six decimal places are listed in Table 74. 


TABLE 74 
EVALUATING INTEGRAL (7) BY SIMPSON’S FORMULA 


© 


Yzy) 
| 000000 | ' 


0.000000 0.000009 
0.000056 0.000216 
0.000624 0.001508 


0.003225 0.006316 


1 
2 
3 
4 
5 
6 
7 
8 
9 


0.011588 0.020239 


0.015493 0.008049 





From ae we have 


1, = zyz (0.020239 -+ 4-0.015493 + 2-0.008049) = J x 
x 0.098309 = 0.0016385 


= 353 


Hence, by (6), we have 
p „ 15691585 | 
= + 0.0016385 Í 7 


Note that the integral / is cee in elementary terms and 
its exact value is 


1.5707970 


I =Z = 1,5707963... 


Note. In some cases, an improper integral can be transformed 
into a proper integral by a change of variable or by integration 
by parts 


16.16 Graphical integration 639 


Example 2. Transform 


iv) 


dx 
=\agavF ® 


into a proper integral. 


Solution, Putting x= in (8), we obtain an integral with finite 


limits: 
d i d 
Je aaa a 9 
J zal) Ve Jt )VE (9) 


but with a singularity at x=0. 
Integrating by parts, we have 


ve EZ eva] 


0 








IVT Tao 


~142) F 


The remaining integral is a proper integral, and quadrature for- 
mulas can be applied to it without any difficulties. Other techni- 
qués are also used for the transformation of improper integrals [6]. 


16.16 GRAPHICAL INTEGRATION 


The problem of graphical integration consists in the following: 
from the given graph of a continuous function y=f (x) it is re- 
quired to construct the graph of its antiderivative: 


F (x) = | f(x) dx 


In other words, we have to construct a curve y= F (x) whose or- 
dinate at each point x is numerically equal to the area of & cur- 
vilinear trapezoid with base [a,x] bounded by the given curve 
y=f(x). 

For an approximate construction of the graph of the antideri- 
vative y= F (x), we partition the area of the corresponding curvi- 
linear trapezoid bounded by the curve y=f(x) into narrow verti- 
cal strips using the ordinates drawn at the ii E E O 
=X, L4 [Lx <...) (Fig. 77). Using the mean-value theorem, 
we replace each of these strips with an equal (as far as possible) 


640 Ch. 16. Approximate Integration of Functions 


rectangle having the same base and with altitude equal to f(E,) 
where E,(=1, 2,...) is an intermediate point of the ith interval 
[x;-1, xz]; thus, we put 


| F(x)dx =f E) (x;—2:-4) 


Xii 
where 


ee oe CE i 





The values of the antiderivative 
x 

F (x)= f(x) de 
Xo 


may be computed at the points x; by the method of accumulation: 
F (x,)=0, 
xi i= Xi 
= È Fæ)dr= J det | Fades 
Xo Kpag $ 
= F (ea) + FED Oy — a. 2 (@=1,2,...)° (D 
Let M,(E,, F (&,)), M: (Èa F(E,)), ... be the corresponding points 
of the curve ya d(x x}. Projecting them on the y-axis, we obtain 
the points Mi, Mj, ... (Fig. 77). 
Now choose pole P with the distance OP=1 and draw rays 
PM;, PM, .... The desired curve y=F(x) may be approxima- 
tely replaced by the polygonal line N,N N,N, ... with vertices 


16.17 On cubature formulas 641 


N, (x, 0), Ni (%1 F(x,)), N, (x,, F(x,)), .... The successive seg- 
ments of this polygonal line will be parallel to the corresponding 
rays; namely, NoN, || PM;; N,N, || PM}, N,N, || PM;, .... Indeed, 


on the basis of formula (1), the slope of the segment N,_,N; is 


p EDF (ki-i) =f (E) 


Xi— Kind 
Now, by virtue of the construction, the slope of ray OM; is 
k =E 7) 


Hence 
N,_,N; || OM; (@i=1, 2, ...) 


Thus, the actual construction of the graph of the function y =F (x) 
mav be carried out as follows: from point N,(x,, 0) draw the 
Straight line N,N, parallel to the ray OM, to intersection with 
the vertical line x=x, at the point N,; from NM, draw the straight 
line N,N, parallel to the ray OM, to intersection, at N,, with the 
vertical x =x, and so on. 

Note that when applying this method of graphical integration 
the points x,(i=0, 1, ...) need not be taken equally spaced. To 
increase the accuracy of the construction, it is advisable to include 
in the sef of points x, the characteristic points of the graph of 
the function to be integrated (zeros, extrema, points of inflection). 

Generally speaking, graphical integration has a low degree of 
accuracy and so is useful when the aim is to obtain a rough pic- 
ture of the integral of the function or when the integrand is spe- 
cified graphically and its analytical expression is not known. 


*16.17 ON CUBATURE FORMULAS 

Cubature formulas are designed for numerical evaluation of double 
integrals [1}. 

Suppose a function z= f (x, y) is defined and continuous in some 
bounded domain o (Fig. 78). In this domain o we choose a set of 
points (lattice points) M;(x;, y) (¢=1, 2, ..., N). To compute 


the double integral IV Fe, y) dx dy we approximately put 


(9) 


N 
IS F(x, o dedy= X Af (xn y) (1) 


(9) 


In order to find the coefficients A, we will require that the 
cubature formula (1) hold true for all polynomials 


Pa (x, = Pa „Caty (2) 


41 9616 


642 Ch. 16, Approximate integration of Functions 


whose degree does not exceed a specified number n. For this it is 
necessary and sufficient that formula (1) be exact for the product 
of powers 


xty! 
(k, 1=0, 1, 2, .+., nm; R+1<n) 
Putting f(x, yy=x*y’ in (1), we have 
N 
Ip = {A ky! dxdy= > A,xt h 
(0) t=1 
(k, L=0, 1, 2, ..., n, kin) (3) 


Thus, generally speaking, the coefficients A,; of (1) can be deter- 
mined from the system of linear equations (3). 





a 
Fig. 78 Fig. 79 


For the system (3) to be determinate, it is necessary that the 
number of unknowns N be equal to the number of equations, whence, 
forming a “lattice of exponents” (Fig. 79), we obtain 


N=(n+1)+n+...¢1=2t0G4) 


A difficult and still open question is that of the most approp- 
riate choice of lattice points for a given domain. 

Another sufficiently general technique for computing a double 
integral may be indicated. Suppose the domain of integration is 
bounded by continuous single-valued curves 


y=Q(x), y= p(x) @()<yH(r)) 
and two vertical lines x=a, x=b (Fig. 80). 


‘Using familiar rules, set up the limits of integration in the double 
integral 


I= $ Í fix, y)dx dy (4) 


(0) 


16.17 On cubature formulas 643 


to get 
bie) 
SF, dxdy=Sax $ f(x, y)dy 
Sitios (9) a © (x) 
ap (x) 
F(= J f(x, yay (5) 
@ (x) 
Then 
b 
f SF y) dx dy = | F (x) dx (6) 


Applying one of the quadrature formulas to the single integral in 
the right-hand member of (6), P 
we obtain 


0G Fx, y)dedy= SC, F (x) (7) 


(o) i=l 





where x;€ [a,b] (i=1, 2, ... 
n) and ic, are certain constant 
coefficients. In turn, the values 
Yp (xi) g 
F(x)= | Fle, y)dy nace 
9 (xi) 
may be found from certain quadrature formulas 


F {x)= 2, Bf (xi Yj) 
J= 


where B;; are appropriate constants. 
From formula (7) we derive 


f \ F(x, y) dx dy= 3 SCB (Xi yi) (8) 


t0) tel j=l; 


where C; and B;; are known constants. 
Geometrically, this method is equivalent to the compatto of 
a volume / expressed by the integral (4) by means of cross-sections. 
The general remarks made with respect to the computation of 
single integrals (see Sec. 16.10) remain valid, with appropriate 
modifications for cubature formulas of type (8). 


` 


644 Ch. 16. Approximate Integration of Functions 


“16.18 A CUBATURE FORMULA OF SIMPSON TYPE 
To begin with, let the domain of integration be a rectangle: 
R{axx<d, b<y<B} 


(Fig. 81), whose sides are parallel to the coordinate axes. Halve 
each of the intervals [a, A] and 
[b, B] by the points 


Xp =a, X,=a+h,x,=a+2h=A 
and, respectively, 
Yo=b, Yy, =b+k, y,=b4-2k=B 
where 


A—a pee 


zo? 2 





h= 


Fig. 81 


; In all, we thus obtain nine 
points (x; y;) (i, /=0, 1, 2, ..., 9). We have 


A B 
f St y) dx dy= | de | F(x, y) dy (1) 
) a 


Now, computing the inner integral by Simpson’s quadrature for- 
mula, we find 


A : 
| S Fle ydedy =f dr- [F (e yo) $4 (% p) +F (e, wl = 


o 


A A A 
-4 [fre yyde+4\ Fx, wn) det | F(x war 


a 


Again applying ai s formula to each integral, we get 
VS Flos yd dy =F {UF (o Yo) +4 (En o) EF Cn Yoll + 


(R) 
+4 [7 (xos Hr)-+ 4 (21s Y) H F (xos 4,)] + 
+ [F (xos Yo) + 4F (xis Ya) +F (Xa Yo)]} 
or 


SUF (x, yy dey = FAI o Yo) +F Oa Yo) HF Co ya) H 


(R) 
+E (Xa) Yo) +4 [Ft Yo) +h Hoy) + . (2) 
+F» n) FH Ya)l + 16F (X41, Y1)} 


16.18 A cubature formula of Simpson type 645 
We will call (2) Simpson’s cubature formula. Hence, 


f (Fix, E E 160,) ` (2') 
(R) 


where o, is the sum of the values of the integrand f(x, y) at the 
vertices of the rectangle R, o, is the sum of values of f(x, y) at 
the midpoints of the sides of the rectangle R, 0,=/(x,, y,) is the 
value of the function f(x, y) in the centre of R. “The multiplicities 
of these values are given in Fig. 81. 


Example 1. Applying the Simpson cubature formula, eami the 
double integral [7] 


Solution. We take 
pa 4,4—4 2,6—2 


=0.2 and k= = =0.3 








St ; 1 f : 
The corresponding values of the integrand z= z are listed in 
Table 75. E 
TABLE 75 
COMPUTING A DOUBLE INTEGRAL BY SIMPSON'S FORMULA 


0.125000 0.119048 0.113636 
0.108696 0, 103520 0.0988142 
0.096154 0.0915751 0.0874126 





Applying the cubature formula (2), we obtain 


r = 22,09 1(0.125000 + 0.113636 + 0.096154 + 0.0874126) -+ 


4.4 (0.119088 -+ 0.108696 + 00988142 + 0.0915751)+ 
+ 16-0.103520] = 0.0250070 
The exact value of this double integral is 
4.42.6 


f f zua In 1.3- In 1,1 = 0.0953108 - 0.262364 = 0.025006 1 
) 





Hence, the residual error is 
A = | 0.025006 — 0.0250070 | = 0.0000009 œ 107° 


646 Ch, 16. Approximate Integration of Functions 


lf the dimensions of the rectangle R ja <x <A, b<y<B} 
are great, increased accuracy of the cubature formula (2) is obtained 
by partitioning the domain R into a system of rectangles and 
applying Simpson’s cubature formula to each. 

Suppose that the sides of the rectangle R are divided into n and 
'm equal parts, respectively; the result will be a relatively coarse 
‘net nm of rectangles (in Fig. 82, the vertices of these rectangles 


Fig. 82 





are indicated by the large circles). We then subdivide each of these 
rectangles into four equal parts. The vertices of this fine net of 
rectangles will represent the lattice points M;; of the cubature 
formula. 





Let 
A-—a 
h= 2n 
and ; 
B— 
ony 


Then the lattice of points will have the following coordinates 
Xi= X + ih (4,=a; i=0, 1, 2, ..., 2n) 


and 
Yj=Yo+ik (W=b, j=0, 1, 2, ..., 2m) 
For the sake of brevity, we introduce the notation 
. Fen y)=fi 


Applying formula (2) to each of the rectangles of the coarse net, 
we have (Fig. 82) 


f \ Fs, y) dxdy = 2 5 Kfar, 23+ frito, ait feitt, afta 


~ (R) i=0 j=0 
+ foi, ee F 4 (faisa, ej t fita, 2j+1 Fferi, sp F fei, oj+1) + 
+ lGa ajar] 


16.18 A cubature formula of Simpson type 647 


Whence, collecting like terms, we finally get 


hk 2a 2m 
{Vi (x, ndxdy=o Y D ity (3) 
(R} f=0 j=0 


where the coefficients 4,, are the corresponding elements of the 
matrix 


If the domain of integration o is curvilinear, then we construct a 
rectangle Roo whose sides are parallel to the coordinate axes 





Fig. 83 


(Fig. 83). We consider the auxiliary function 


{ f(s, y) if (x, y)Eo, 
PH D=) 9 it, WER 


We then clearly have 


Il F(x, y)dxdy =$ È ft (x, y)dxdy 
(9) (R) 


This integral can be approximately evaluated by the general cu- 
„ature formula (3). 


648 Ch, 16, Approximate Integration of Functions 


REFERENCES FOR CHAPTER 16 


Jl) Sh. E. Mikeladze, Numerical Methods of Mathematical Analysis, 1953, 
Chapters XIII, XVII (in Russian), eee 

[2] E. Milne, Numerical Calculus, 1949, Chapter IV. 

[3] S. M. Nikolsky, Quadrature Formulas, 1958 (in Russian). 

[4] A. Markov, Calculus of Finite Differences, 1911, Chapter V (in Russian). 

[5] J. F. Steffensen, Interpolation, 1927. 

[6] Z. S. Berezin and N. P. Zhidkov, Computational Methods, 1959, Volume 1, 
Chapter III (in Russian). 

[7] J. B. Scarborough, Numerical Mathematical Analysis, 1955, Chapter VII. 

[8] A. N. Krylov, Lectures on Approximate Computations, 1933, Chapter III 

_ (in Russian). 

[9] V. I. Krylov, Approximate Calculation of Integrals, 1959 (in Russian). 

[10] M. G. Salvadori and M. L. Baron, Numerical Methods in Engineering, 1952, 

[11] G. M. Fikhtengolts, Course of Differential and Integral Calculus, 1948, Vo- 
lume 2, Chapters IX, XIII (in Russian), 


Chapter 17 
THE MONTE CARLO METHOD 


17.1 THE IDEA OF THE MONTE CARLO METHOD 


The usual way of solving a problem is to indicate an algorithm 
(a sequence of operations) which yields the desired quantity f 
either exactly or to a specified accuracy. Namely, if we denote 


by fis for --+> fas »-- the corresponding results of sequentially 
accumulating operations, then 


>D 


and in the case of a finite number of operations, the process is 
terminated at some stage. Here the computation procedure is 
strictly deterministic: in the absence of mistakes, two diferent 
computors obtain the same result. 

There are problems, however, for which it is practically impos- 
sible to construct an algorithm or the algorithm is prohibitively 
complicated. In such cases, one often resorts to modelling the 
mathematical or physical essence of the problem and to using the 
law of large numbers from probability theory. The estimates 
fo fos +++> fas »-- OF the desired quantity f are obtained via a 
statistical "treatment of material involving the results of certain 
random trials repeated many times. It is then required that the 
random quantity f, converge in probability to the required quan- 
tity f as n— œ [1], [2]; that is for any e >0 we must have the 
‘limiting relation 


lim P(f—fn| <9)=1 (2) 


where P denotes the appropriate probability. 

The choice of the quantity f, is determined by the specific pe- 
culiarities of the problem. For example, one often interprets the 
sought-for quantity f as the probability of a random event (or, 
more generally, as the mathematical expectation of some random 
quantity). Then the frequency f, of occurrence of the event for n 
random trials (or, respectively, the empirical mean of the values 


650 Ch. 17. The Monte Carlo Method 


of a random quantity) may, under broad assumptions, be regarded 
as a probability estimate of the sought-for quantity. Other ver- 
sions are also possible. It will be noted that in these cases the 
computational procedure is nondeterministic since it is determined 
by the results of random trials. 

Methods of solving problems that employ random quantities are 
classed under the general heading of the Monte Carlo method. To 
be more precise, the Monte Carlo method [3], [4], [5], |6} embra- 
ces a collection of techniques which permit obtaining the solutions 
of mathematical or physical problems by means of repeated random 
trials. The estimates of a sought-for quantity are derived statisti- 
cally and are of a probabilistic nature. In actual practice, random 
trials are replaced by the results of certain computations performed 
on random numbers (see Sec. 17.2). 

The Monte Carlo approach came into effective use with the 
advent of high-speed electronic computers, since to obtain a suf- 
ficiently exact estimate of the desired quantity requires performing 
computations for a very large number of special cases with subse- 
quent statistical treatment of a stupendous amount of numerical 
data. If must be pointed out that when using the Monte Carlo 
method there is no need to know the exact relationships between 
the given and the sought-for quantities of the problem; it suffices 
only to elucidate the set of conditions under which an appropriate 
event occurs. This circumstance makes it possible to use the Monte 
Carlo method in solving logical problems. 

Of the mathematical problems to which the Monte Carlo method 
has been applied, we mention the following: solving systems of 
linear equations, matrix inversion, finding the eigenvalues and 
eigenvectors of a matrix, evaluating multiple integrals, solving 
the Dirichlet problem, solving functional equations of'a variety 
of types, and so on. The Monte Carlo method is also successfully 
employed in the solution of probiems of. nuclear physics. It is 
well to bear in mind that even in the solution of one and the 
same problem the scheme for applying the method. may- be essen- 
tially different. ` ; ; 

In this chapter we consider the computation of multiple integ- 
rals and the solution of systems of linear equations by the Monte 
Carlo method. Information concerning other types of problems may 
be found in the special literature (see, for example, [3], biblio- 
graphy, and also [6]). 


17.2 RANDOM NUMBERS 


In actual applications of the Monte Carlo method, random trials 
are ordinarily replaced by sampling random numbers. 


17.2 Random numbers 654 l 


Definition 1. A random quantity is one whose values depend on 
the outcome of a random event. 
A random quantity X is given by the distribution law 


P(X < x)=0 (x) 


where x is any real number and M(x) is a known function (the 
distribution function). The vatues of a rendom quantity are called 
random numbers. 


Definition 2. If a random quantity has a given distribution law 
[1], [2] (uniform, normal,. and so forth), then we will say that 
the appropriate random numbers are distributed by this law. 

Suppose the numbers x,, Xa, ..., Xp» ... are the values of one - 
and the same random quantity x: under independent trials with 
recurrent conditions. Then the sequence of random numbers 


{Xn} (1) 
is called a random sequence with the appropriate distribution law. 
In what follows we will, as a rule, consider random sequences (1) 
uniformly distributed on the unit interval O<x< 1. If (a; b) is 
any interval” extracted from [0, 1} and v, ==, (a, b) is the number 
of elements in a finite subsequence xi, x,, ..., Xy belonging to 
(a, b), then for the uniformly distributed sequence (1) we have 
the limiting relation 

lim 


n> w 


In (a, b 
wP ba (2) 


which means that the limiting Haig frequency of the sequence 
{x,} uniformly distributed on [0, 1] is, for each subinterval (a, b), 
equal to the length of the ie with probability 1. 

If the random sequence {x,} is uniformly distributed on the 
interval [0, 1], then the linear transformation 


y,=A+(B—A)x, (n= 1, 2, ...) i (3) 


where A and B are given numbers, reduces to the random sequence 
{y,} uniformly distributed on the interval [A, B]. Generally, 
having a random sequence {x,} uniformly distributed on the in- 
terval [0, I], we can construct a random sequence {y,} with a 
specified ag law Ọ (y). Namely, suppose 


y 


D(y)= | @lt)de 


1) By agreement, the endpoints a and b may or may not be included in the 
interval (a, b). 


652 Ch. 17. The Monte Cario Method 


is the appropriate distribution function,» where p(t) is the pro- 
bability density. 
For the sake of simplicity, we assume that the function 
x=0 (y) 
is continuous and strictly monotonic (Fig. 84). Then, determining 
y, from the equation 
=O(y,) (n=l, 2...) 


we obtain for each x, the random sequence {y,} having the speci- 
fied distribution law ® (y). Namely, by the mode of construction 





Fig. 84 


the following limiting relation will hold,. with probability 1, for 
the sequence {y,,}: 


x b 
-vn (a, b) 
lim E ow) dy (4) 
a 
where v,(a, b) is the number of elements of the finite sequence 
Y ---» Y, belonging to the arbitrary interval (a, b). 
In particular, putting 
y? 


1 2 
PU 


we obtain by this method a canonical normally distributed (Gauss- 
ian) random sequence {y,} which corresponds to a random quan- 
tity Y with mathematical expectation MY =0 and variance DY = 1. 
The linear transformation 


Za = OY, +C (n=1, 2, ...) 


D lf y, (n=1, 2, ...) lie in the finite interval A < y < B, then we as usual 
assume  (y)=0 for oft [A, Bl, 


17.3 Ways of generating random numbers 653 


yields a normally distributed random sequence {za}, which corres- 
ponds to the random quantity Z for which the expectation MZ =c 
and the variance DŻ = œ. — 


17.3 Ways OF GENERATING RANDOM NUMBERS 


Random numbers may be generated by using the results of 
random physical processes (say, dice, roullete wheels, scintillations 
in a Geiger-Müller counter, noise generated in electrical transmis- 
sion systems, etc.). There are also available tables of random 
numbers (see, for example, [7], [8]). 

Strictly speaking, when we use mechanical devices to obtain 
random numbers one is not completely sure that we are dealing 
with random events with a specified distribution of probabilities. 
Such material is therefore tested statistically for randomness. In 
this sense, tables of random numbers are a more reliable source, 
since they have already been tested for randomness. However, the 
use of tables of random numbers for solving problems on digital 
computers often involves great inconvenience [9]. 

Monte Carlo solutions ordinarily require a very large number 
of random numbers. For practical purposes, these numbers are 
most conveniently obtained by means of special randomizing devi- 
ces which are hooked up to the computer. The operation of these 
devices is regulated by random physical processes (such as ra- 
dioactive decay, noise in electronic tubes, etc.) [9]. 

Since the generation of random numbers that meet a given 
theoretical mode] is an extremely delicate and involved process, ` 
one often confines himself to obtaining so-called pseudorandom 
numbers, which largely resemble the corresponding random num- 
bers. Rather involved mathematical algorithms are required for 
the production of pseudorandom numbers. In the sequel, the term 
“random number” will include both random and pseudorandom 
numbers if the distinction between them is not essential. 

We now give some simple techniques for obtaining random 
numbers (in the broad meaning) uniformly distributed on the 
interval [0, 1]. For the sake of simplicity, we assume that these 
numbers are pure decimal fractions with a fixed number of deci- 
mals, say s (s-digit decimal fractions); that is, such that can be 
written as 


. & 

s= utt (1) 

where œ; (i=1, 2, ..., a are the digits of this number which 
take the’ values Q, 1, 3, 4, 5, 6, 7, 8, 9 


In.order to a a table of random numbers of type (1) 
uniformly distributed on [0, 1], it is sufficient to indicate the 


e 


654 Ch. 17. The Monte Carlo Method 


methods of obtaining the digits a; and to meet the following con- 
ditions: 

(a) a; is a random sampling taken from the set. of numbers from 
0 to 9, all indicated values being equally probabie and mutually 
independent; 

(b} the choice of a digit æ; does not in the least affect the choice 
of the next digit a;4,. 

A sampling of this type is performed s times in order to obtain 
an s-digit random number. 

The mode of choice which meets conditions (a) and (b) can 
actually be realized in many ways. We consider several. 

1. Place in an urn ten identical balls numbered from © to 9. 
The balls are drawn from the urn in succession and the numbers 
œ are recorded. After each extraction, the ball is returned to the 
urn and all the balls in the urn are mixed before each new 
extraction. 

2. Two dice are thrown at one time. If n, and n, are the sums 
obtained (n, nz=1, 2, 3, 4, 5, 6) of the first and second die, 
respectively {the dice must be distinguishable), then the digit a 
of the random number is taken equal to the remainder leit after 
dividing the sum 6(n,—1)+-n, by 10, where n,<6, that is, 
a is a nonnegative integer less than 10 that satisfies the cong- 
ruence ” 


6 (ny —1)-+-n, =a (mod 10) . (2) 


Ii n,=6, the dice are thrown again. From formula (2) it follows 
that the digit « can with equal probability assume any value 
from 0 to 9 (see [7]). 

3. An s-digit integer is squared and the middle s digits are 
extracted; then the process is tepeated. li s is sufficiently great, 
say sæ 10, then the extracted digits may be taken as sets of 
elements of s-digit pseudorandom numbers [3], 

To obtain a sequence of pseudorandom numbers, one can also 
multiply a multidigit number by a fixed multiplier and extract 
middle digits or square a multidigit number and reduce modulo 
some sufficiently large prime. 

4. A pseudorandom sequence {x,} is generated by means of the 
process [10] 


X, = 27 Pu, 
where 
Uy = l y Unad == ou, (mod Ze) 


1) The notation a=b (mod k) (a, b, k integers) states that the difference 
a—b is exactly divisible by k. 


17.3 Ways of generating random numbers 655 


5. Use is made of the decimal expansion of a Positive irra- 
tional number 


o=B,, BiP ... By .-. = Bo +(e) 


where B, is the integral part of the number œ and (œ) is its frac- 
tional part. 
To obtain a random sequence {x,} we assume 


x, = (nw) (n=1, 2, ...) 


If a random sequence consisting of s-digit numbers is- required, 
then one confines himself to the appropriate digits in the num- 
bers (nw). 

To solve certain problems, one needs several random sequences: 


Ph 62g sy AAE , 
In this case, one chooses m linearly independent positive irratio- 
nalities œ, ©, ..., @, and puts 
x® = (nw,) (k=1, 2, ..., m, n=1,.2, ...) 


It is also possible to take one uniformly distributed random 
sequence {x,} and perform m samplings from it: 


{%1, Amt. Xom+1> soit 
i aos Wings Wiese Aarin taal 


oe e » ù» ùp © >ù > o 


FA E AE 


taking numbers after every m elements instead of in succession. 
We will then obviously have m uniformly distributed subsequences. 


TABLE 76 
RANDOM NUMBERS UNIFORMLY DISTRIBUTED ON f[0, 1] 


0.57705 0.35483 0.11578 0.65339 0.66674 
0.71618 0.09393 0.93045 0,93382 0.99279 
0.73710 0.30304 0.93011 0.05758 0.24202 
0,70131 0.55186 0.42844 0.00336 0.94010 
0.16961 0.64003 0.52906 0.88222 0.60981 ` 


0,53324 0.20514 0.09461 0.98585 0.13094 
0.43166 0.00188 0.99602 0.52103 0.35193 
0.26275 0.55709 0.69962 0.91827 0.64560 
0.05926 0,86977 0,313114 0.07069 0.64559 
0.66289 0.31303 0.27004 0.13928 0.68008 





656 Ch. 17. The Monte Carlo Method 


Tables of random numbers are compiled by these and other 
methods. These tables ordinarily give random decimal digits, from 
which it is easy to construct random numbers of a specific size. By way 
of an example, we give a portion of a table (see [7]) of five- 
digit random numbers (Table ~76). 


17.4 MONTE CARLO EVALUATION OF MULTIPLE INTEGRALS 


Let the function 
y=f(x, Xos rey Xm) 


be continuous in a closed bounded domain S and let it be re- 
quired to evaluate the m-fold integral 


a Ve) (i tas ee T (1) 
(S) 
Geometrically, the number 7 is an (m+ l)-dimensjonal volume of 
a right cylindroid® in the space Ox,x, ... Xmy constructed on 
the base S and bounded above by the given surface y= f(x), 
where x= (Xs Xa «++, Xm) (Fig. 85). : 


J YS) 








Fig. 85 


We transform the integral (1) so that the new region of integ- 
ration lies entirely within the unit m-dimensional hypercube. Let 
the domain S be located in the m-dimensional parallelepiped 

a; < Xi < A, (2) 
(i=1, 2, ..., m) 
1) More precisely, an algebraic volume in which it is assumed that the parts 


of the cylindroid located above the hyperplane Ox,x, ... x» are of positive 
measure, those below that hyperplane, of negative measure. 


17.4 Monte Carlo evaluation of multiple integrals 652 


We make the change of Variables 
x= a; +(A;— a) $i (3) 
(= Sh 2, ..., m) 


It is then clear that the m-dimensional parallelepiped (2) will be 
transformed into the unit m-dimensional hypercube 


O<t<1 (=12...,m — (4) 


and, consequently, the new region of integration o which is found 
by the ordinary rules will lie entirely inside this hypercube 
(Fig. 86). 

Computing the Jacobian of the transformation, we have 





A,—a, 90 ae 0 
DSi Mores tg) 0 A,—a, ... 0 = 
Der a estay oe eo ene i = 
: 0 O° bie Agee: 
= (A,—a,) (A,—a,)...(A, —a,) 
Thus 
=f... SFG, Se +++ Ep) dE, dE, oe dE (5) 
where . 
Pes E a cery: Em) = (4, — a) (A,—a,) ena (An— an) X 
Xf (a, +(A—a,)§,, a, + (Ay—a,) bo ++) Gp + (An— an) En) 


Introducing the notation 


$= (f, zA Herny, Em) 


do = dE, dt, ... dEn 
we can write (5) more compactly as 


r= Sf r | F (8) do 6) 


We will give two ways for evaluating integral (5') by the me- 
thod of random trials. 


and 


First method: Choose on the interval [0, 1] m uniformly distribu- 
ted independent sequences of random numbers: 


EP’ EPs een ee 


a, oe ities > Her Lie | 


bi”, eum. T (m) 
The points M; GP, p, RESE ey i= 1,2, ...) may be regarded 


49 9616 


658 Ch. 17. The Monte Carlo Method 


as random. Choosing a sufficiently jarge number N of points M,, 
» +--+, My, we check to see which belong to the domain o 
(first category) and which do not (second category). Let (Fig. 87) 


(1) M,€o0 for t=1, 2, ..., n (6) 
y=F 6) 





é 
Fig. 86 - Fig. 87 

and 
(2) M,¢€0 for i=n+1,n+2,...,N (6’) 


(for the sake of convenience we change the numbering of the 
points here!), With regard to the boundary P of the domain o, 
it is necessary beforehand to specify which boundary points, or 
what part of them, are to be regarded as belonging to o and 
which as not belonging to this domain. This is of no particular 
significance in the general case for a smooth boundary I; but in 
certain cases it must be settled with regard for the concrete si- 
tuation. 

Taking a sufficiently large number n of points M;€o, we put, 
approximately, 


i n 
J= Do F (Mì 


whence the desired integral is given by the formula 


an 


l= Yoo =— ¥" F (M) (7) 
i=] 


where o is understood io mean an m-dimensional volume in the 


17.4 Monte Carlo evaluation of multiple integrals 659 


domain of integration o. If it is difficult to compute the volume 
of o, then we can take 


a 
R 


2| 


whence 


3 


1 
I= y È, FM) 


i=l 
In the particular case where o is the unit cube (o= 1), verifica- 
tion is unnecessary; thus n= N and we simply have 


l N 
l= SF (M) 
i=l 


In verifying the conditions (6) and (6’), one ordinarily proceeds 
from the analytic representation of the boundary T of the domain o. 
_ In the simplest case, if the surface I is given by the equation 

9 (§) =0 (8) 
where for @(&)<0 the point €o and for p(E) > 0 the point 
&¢o, we have: (1) if ọ(M, <0, then the point M, is of the first 
category and (2) if ọ (M; >0, then the point M; is of the second 
category. The points M; for which g(M,)=0 are classed in the 
first or second category by convention. Note that equation (8) may 
be replaced by any equivalent equation. This sometimes greatly 
eee computation. Thus, for instance, the inequality for a 
circle 


P+ y—x—y +z <0 
is more conveniently replaced by the equivalent inequality 
(+C) 
2 et ee | 


since the latter one is more easily verified. 
Ii the domain o is standard and given by the inequalities 


& <i <, | 
EDLE < RE, | (9) 


En (Er, srry ey oe eG E En ror Ena -1) 


ae to determine the membership of a random point M(&,, 
an m) in the first or second category, one verifies the vali- 
ie of these inequalities. 


660 ` Ch. 47. The Monte Carlo Method 


TABLE 77 
SCHEME FOR DETERMINING MEMBERSHIP OF A RANDOM POINT 
M$... Em) IN THE STANDARD DOMAIN (9) 





This is conveniently done by the scheme given in Table 77. Here 
fı if EEE), E] 
(0 if glis El 
(i= 1, 2, ..., Mm) and e=e,e,...6,. Clearly 

if ex1, then MED, 

if e=0, then M€o 
Note that if eẹ;,=0(} <m), then the subsequent values ¢,,,,..., 

need not be computed since they ie 

y not affect the final result. The value 
of the function y= F (M) is compu- 
ted only for those points M for which 


e=1. Then formula (7) is used to 
evaluate the integral /. 


Example. Give an approximate eva- 
luation, by the Monte Carlo method, 
of the integral : 


= i (x? + 4°) dx dy (10) 


where the domain o of integration 
is defined by the inequalities 


0<y<2x—1 (0) 


> e= 





(Fig. 88). 


17.4 Monte Carlo evaluation of multiple integrals 661 


Solution. The integral (10) is given in reduced. form, that is the 
domain o of integration is located in the unit square 


O<x<l, OS y<l 


To solve this problem, we take advantage of. the table of random 
numbers (Table 76), regarding each succeeding pair of numbers 
in the table as the corresponding coordinates x and y of the random 
point M(x, y). Since the computation is of an illustrative nature, 
we confine ourselves to N=20 random points; for the sake of 
simplicity, we round their coordinates to three decimal places. 
The results of the computations are listed in Table 78 where we put 


’ x=, 


bol = 


x= 


y (x) =0, y (x) =2e—1, z=% 4y 
TABLE 78 


EVALUATING THE DOUBLE INTEGRAL (10) 
BY THE MONTE CARLO METHOD 











0.577 0.500 1.000 |1; 0.716 0 0.154 | 0] 0 
0,737 0.500 1,000 | 1j 0.701 0 0.474 | 0] 0 
0.170 0.500 | 1.000 | 0] 0.533 0 
0.432 0.500 1,000 | 0] 0,263 0 
0.059 0.500 1,000 |} 0] 0.663 0 
0.355 0.500 1.000 | 0} 0,094 0 
0,303 0.500 1.000 | 0) 0,552 0 
0.640 0.500 1.000 | 1] 0.205 | 0 0.280 | 1] 1| 0.452 
0.002 0.500 1,000 | 0] 0.557 0 
0.870 0.500 1.000 | 1] 0.323 0 0.740 | 1} 1] 0.855 
0.116 0.500 1,000 | 0} 0.930 :0 
0.930 0.500 1.000 | 1] 0.428 0 0.860 j; 1) 1 1,048 
0.529 0.500 1.000 ; 1| 0.095 0 0,058 | 0] 0 
0.996 0.500 1.000 | 1] 0.700 0 0.992 | 1] 1 1.482 
0.313 0.500 1.000 | 0} 0.270 : 0 
0,653 0.500 1.000 | 1| 0.934 0 0.306 | 0] 0 
0.058 0.500 1.000 {0| 0.003 f 0 
0.882 0.500 1,000 | 1 | 0.986 0 0.764 | 0] 0 
0.52] 0.500 1.000 | !| 0.918 0 0.042 | 0] 0 
0.071 0.500 1,000 |© | 0.239 0 
ee a 








002 Ch, 17. The Monte Carlo Method 


‘Thus , 
Zap = +- 3.837 = 0.96 


and, hence, by formula (7),, noting that o=}, we have 


4 x 
I= 2qy:6=0.96-- =0.24 (11) 
li we take, approximately, 
„h 4 1 
DEON TOO, 


then we get 
I œ= 0.96- =0.19 


Note that the true value of the integral is 


7 
and so the relative error of the result (11) is 
0.24— 0.22 
oS SA 


The number of points here, ¥=20, is of course insufficient for 
the statistical regularities to exhibit themselves properly; never- 
theless, a result satisfactory enough for a rough orientation has 
been obtained. 

Second method. If the function F® =F, Ea ..-, En) is nonne- 


gative, then the integral (5) may be regarded as the volume of 
a solid V in the (m+ 1l)-dimensional space 0€,&...&,y; that is, 


I=$ ayy S Bik, ... dé, dy (12) 
where the domain of integration V is defined by the conditions 
Let = ($, ae vey En) EG, O<y<F ($) 

e 
O<F(E)<B (13) 
Introducing in Cee the new ra 
T (14) 
we get 


1=B$ ay Sb Bey «dndn 


17.4 Monte Carlo evaluation of multiple integrals 663 


where the new domain v is a cylindroid in the space O&,E, a EN 
constructed on the region o and bounded below by the hyperplane 
y= 0 and from above by the hyperplane 


n=7F 


(Fig. 89). By virtue of inequality (13), the volume v lies entirely 
within the (m--1)-dimensio- 
nal hypercube | 
OSES! 
G=1, 2, ...,-m) 
<< I 


Now take, on [0, 1], m+1 
uniformly distributed inde- 
pendent random sequences 


EPR {Ph -e Eh ing 


whose corresponding elements 
will be regarded as the coor- 
dinates of the random points 


M; 5P, E e.. po ni} 
(Pe 8) 


of space 0€,€,...§,n. If n random points out of the total number 
of N belong to the volume v, and N—n do not belong to this 
volume, then for a sufficiently large number N we can approxi- 
mately put 





Fig. 89 


1x B= (15) 
Thus, 
I—B-P(MEv) 


where the point M can occupy the positions M,, M, ..., My 
with the same probability. The validity of the relation 


MEV 
is verified in the same way as in the first method. Note that if 
ø is a unit hypercube O<E; <1 (i=1, 2, ..., m), then for the 
point M, EP, ..., E, n), all the coordinates of which are assu- 


med to belong to the unit interval [0, 1}, it is sufficient to verify 
only the validity of the relation 


SBP EM, BP, oy BP 


664 Ch, 17. The Monte Cario Method : 


Let us now consider the general case when 
F(E)=F ro be, -r En) 
is an alternating function. Suppose 
—b<F(E)<B (16) 
where b and B are nonnegative numbers. Put 
F &)=—6+(B+5)F 6) 


and we have 
(f... J F@do=— 004 (B+)... |Ë &)do 
(9) (0) 
where the function ņn= F (Ẹ) satisfies, by inequality (16), the in- 


equalities : 
0<FQ®<I 


fi ... SE @)do= f] a { dodn 


(6) 


The integral 


may be evaluated by the method indicated above. 
To estimate the rr of the approximate ee v 


I = | dody= P\Mév) m (17) 


first suppose that we ae to do with ideal random uniformly 
distributed sequences of points M,(i=1, 2, ...), whose coordina- 
tes lie in the unit interval [0, 1]. 

On the basis of the Bernoulli theorem and applying the Che- 
byshev inequality, we have 


1 








P( a )>1— os 1a (18) 
Given, for a specified e, the guarantee probability 
P(|4—1|<e)>1-8 (19) 








we get from inequality (18) that the condition.(19) is definitely 


met if 
] 


. tan = 6 (20) 
From this we derive 
1 
~ 2VÒN ea 


Ð The factor B is inessential. 


17.4 Monte Carlo evaluation of multiple integrals 665 


Thus, the accuracy of the estimate 
n 


Lew 


N 


for a given maximum probability is inversely proportional to the 
square root of the number of ‘trials, or e=O ak This cir- 





cumstance causes the relatively slow convergence of the Monte 
Carlo method; for example, in order to reduce the error of the 
result 10-fold, the number of trials must be increased 100-fold! 
If the accuracy of the estimate e and the guarantee probability 
1—6 are given, then from formula (20) we derive the necessary 
number of trials 


1 
N=aay (22) 


For example, for e=0.001 and 6=0.01 we have 
N = 25,000,000 


The estimate (22) is exaggerated and may be substantially improved! 

We note one important circumstance: the number of trials N 
is independent of the dimensionality of the integral /, and there- 
fore the Monte Carlo method is used to advantage in computing 
multiple integrals of high dimensionality, where the use of ordi- 
nary cubature formulas encounters appreciable difficulties. For 
example, in order to approximate in the ordinary way a 10-fold 
integral extended over a unit volume with spacing h=0.1, one 
requires a sum containing roughly 10! terms! 

In practical Monte Carlo evaluation of multiple integrals, one 
ordinarily uses s-digit uniformly distributed random sequences. 
In this case, the faction + will, if N is great, be close not to the 
true volume 7, but to a certain fictitious volume J), which appro- 
ximately represents the relative measure of the number of points 
M with coordinates of the form 

a n6 (23) 


(i=1, 2, ..., m, kp k=0, 1, 2 ..., 10°) 


that fall in volume v (cf. Sec. 17.3); strictly speaking, 7, depends 
on whether the boundary points belong to the volume v or not. 
The total error of the result is estimated in the following manner 
(see [2]): 





wey. 


|y (24) 





<Hi—hl+| n- 





666 Ch. 47. The Monte Carlo Method 


The first term |/,—J/,| of the right member of (24) is the usual 
computational error obtained when replacing the integral /, by a 
sum corresponding to a partition of the volume v into elementary 
cubic cells whose vertices belong to the grid (23). The magnitude 
of this error may be evaluated by means. of the inequality 


[=h] (25) 


where v is the upper sum [in our. case, for integral (17), it is 
simply the volume ‘of the circumscribed step-like solid] and v is 
the lower sum (that is, the volume of the inscribed step-like 
solid). The magnitude of the error |/,—/,| depends essentially 
on the number of digits s in the random numbers, atid if the 
boundary of the solid v is piecewise smooth, then this error can 
be made arbitrarily small for sufficiently large s. The inconvenience 
due to increasing the number of digits lies in the resulting increase 
in the amount of labour, since the computations then involve 


additional digits. The second term h~F| in the right-hand mem- 


ber of inequality (24) is called the sampling error and may be 
estimated probabilistically by means of the Bernoulli theorem, as. 
indicated above. 





“17.5 SOLVING SYSTEMS OF LINEAR ALGEBRAIC. EQUATIONS 
BY THE MONTE CARLO METHOD 


Let us consider the linear system 
È Ajj = 4 ü=1, ..., n) (1) 


In some way we reduce system (1) to the special form 


x= Boy) +B; CH 1p a on) (2) 


Introducing the matrix «= [a,,] and the vectors 


B, 


we can write system (2) in the matrix-vector form 
x=ax-+ 8 (2’) 


We will assume that all eigenvalues of the matrix @ are less than 


17.5 Solving systems of lin. algebr. eqs. by Monte Carlo 667 


unity in modulus. In particular, it suffices to consider that some 
canonical norm of the matrix œ obeys the inequality 


leli i (3) 
Then system (2’) has a unique solution which can be found by 
the method of iteration (Sec. 8.10). 
We select a set of multipliers v; so that the numbers p; defined 
by the equations 
i5 Pia > (i, j=l, on) (4) 
satisfy the following conditions: 
(1) p;;=0, and Pi > 0 for a,;40, 


(2) È Pi <! (i=l, ..., 7). 
J= 
Let 
Pi n=l — È pij (=1, ..., n) 


Besides, we agree to put 

Pri j= 0 for j<n+i 2 

and ; üi 
Pnt1, n+i = l 


Let us consider a particle performing a random walk and pos- 
sessing a finite number of possible and incompatible states- 


Si, Sas i s Sn Spt 


This particle is such that it passes from state S; to state S, with 
probability p; (i, j=1, ..., n+ 1), irrespective of previous ‘states 
and with total indeterminacy relative to future states. The state 
S,41=T (a “boundary” or “absorbing barrier”) is special and cor- 
responds to a complete stop of the particle, since, by virtue of 
the condition p,4,,;=0 (j=1, ..., n), transitions from the state 
S43 to the state S, are impossible with probability 1 for j< n+ 1. 
Thus, the process “of a random walk ceases as soon as a particle 
reaches the boundary T. The foregoing change of states is ordina- 
rily called a discrete Markov chain (more precisely, simple homo- 
geneous) with a finite number of states [2]. The numbers p,, are 
called transition probabilities, and the matrix 


Pii +++ Pan Pi, n+. 


H = 


is the transition probability matrix of the states {S;} (chain law). 


668 Ch. 17. The Monte Carlo Method 


Let S, be a fixed state different from the boundary state 
(i<n+ 1). We consider the random walk of a particle which starts 
out from the given state S, =S, and, after a series of intermediate 
states S;, Si ..., Sig, terminates on the boundary Simy, =T. 
Thus, Si (m0) is the state of the particle immediately prece- 
ding its emergence at the boundary. The totality of states 


T;={S,,, Sis rees Sim Simt} ` (5) 


will for brevity be called a trajectory (path). Let X; be a random 
quantity dependent on the random trajectories T; starting with 
the state S, (the functional of the trajectory qT) and assuming on 
the trajectory (5) the value 

E(T)= — Bi, + U1 iB i, + Una na Pia + vue +U;,1, rear Vig aes Biss l (6) 


where B,(j==i,, i ..., in) are the corresponding constant terms 
of the reduced system (2). 
In particular, if v,,;=1, we simply have 


(Tr) B,,+B, +... Bin (6’) 
By the product rule for probabilities, the trajectory 7, and, hence, 
the value — @;), occurs with probability 
P(T,;)= Pipi Pig ++ *Plnimes (7) 
where =i and i,4,=n+1. 
Theorem. The mathematical expectations 
MX; =x; i=l 2A) 


are the roots of system (2). 
Proof. The trajectories T; which start out from the state S, may 
be partitioned into n+1 categories 
T,={S, Si Sp, +--+}, 
T,=4{S;, So Sp b 
Ep = {So Sp Sip p 
Ta m1 = {Sp S n+ 
depending on the first step.; i is, a particle which initiates a 
random walk from state S; can, at the first step, either pass into 
state S, or into state S,, and so on, and upon completion of a certain 
number of steps terminate the random walk at the boundary. 
If a particle has the trajectory 


T y= {S;, Sp Sh .. Sige =I} 
where j=4n+1, then the random quantity X; will, by (6), take 


syai 


im’ 


17.5 Solving systems of lin. algebr. eqs. by Monte Carlo 669 


on the value 
By) Bet va, FU; Uin Bist > T 
oe kaei Vim- simPim = B; + Uy; ( (B, +0;;B;,+ 
RE iT ees Vim — zimP im) =f, = Ue (T; ) (8) 
where T, is some trajectory with initial state S,. 
When the particle reaches the boundary T immediately, that 
is, when the path is of the form 7,,,4,;={S;, np then 


ET; nt) = B; (8) 


The probability that the trajectory T, is a trajectory of type Ty 
is obviously equal to p;; 
By the definition of mathematical expectation, we have 


MX; = XET) P(T SDRE i) P(T) 


Ifj < n+ 1, then the trajectory T; ee of the interval (Sa S) 
and some trajectory T; For ihis” reason, P(T) = p;,P (T )). For 
j=n+1 we have 


(T ty nti) = B; and P (Ti, 241) =Pi, nts 


Moreover, since each trajectory Ty is for jon assoctated in 
a one-to-one manner with the trajectory T,(and vice versa) sum- 
mation over the trajectories T, for j= 1, 3, ..., A may be rep- 
laced by summation over the trajectories T 

Whence, having regard for formula (8), we get 


MX;= E D> Z [B; p+ 0,6 (7; UE PyP (T) + Pipi, nti 
or 
MX; = 2 Puy DE (T) P T)+8, is, pip PAT) + Pi, el 
= J ti J 
But, clearly, 
pee N=MX; (j= 1,2, ..., 2) 
Moreover, BPO, J=1 and 
n n+l 
È py DP (T) + Pr, m= pij=l 
Hence i 
MX; = Èo, MX; vh (@=1, ..., n) 


‘where œ;; = pj,0;j- 


670 Ch. 17. The Monte Carlo Method 


The proof of the theorem is complete. 


‘Note. In proving this theorem we assumed that the mathematical 
expectations 
x;= MX; @=1,..., 2) 


exist. It can be proved that when condition (3) is met the random 
quantities X; have finite expectations. 

From the theorem just proved it follows that the roots of system 
(2) may be regarded as the mathematical expectations of the 


random quantities X,, ..., X,. For an experimental determination 
of the quantity x;= MX, one organizes M random walks with 
random trajectories TP (k=l, ..., N) with initial state S; and 


each time one records the value ETP) of the random quantity Xi 

Suppose that the trials are independent and the quantity X; has 
a bounded variance. Then, by virtue of Chebyshev’s theorem [1], 

[2], for M sufficiently large, the following inequality will hold 
true with a probability arbitrarily close. to unity: 


N 
~y 2 ETP) 
k=1 


where e is the given limiting error. Thus, the roots of the system 
(2) can approximately be determined from the formulas 
N 


nap È ETP) (9) 


<e 








In particular, this method can be used to invert a matrix of 


the form 
A=E—a (10) 


where [ja || <1 and E= [ô;} is the unit matrix. To do this, note 
that the elements of the inverse matrix 
AT T= [xz] 


are the roots of the linear system 
n $ 
2 (Si2— Ai) Xe = y (i, j=l, ..., 2) 


whence we find that the elements of each column 
Xj weeny Xaj =i, ores n) 
of the matrix A`! are determined from the linear subsystem 


= È Qatay ty GH)... 2) (11) 


17.5 Solving systems of lin. algebr. eqs. by Monte Carlo $74 


On the basis of the foregoing, esis from the state S,;=S,, 
for fixed j, we obtain the random quantity X,, with values 


gj (T j= a ith - + ĜimiVii ve Vln — sim 


where T;={S,, Sh .. Simy: =T} and the numbers v;; are 
such that Pijs determined Pia the equations œ; = p;jU;, are 
transition probabilities | from the state S; to the state S The 
expectations MX ;; =X; yield the desired elements of the matrix A-}, 

Let us now show, practically, how to organize a random walk 
of a particle with given transition probabilities p; For the sake 
of simplicity, we assume that p,, are decimal fractions with common 
denominator 10° (s natural): 











tin tiz li nti 
Pa = OF Pr= G oo’ o ona Pi, nt F PoS 
where fa, ty, ---, ¢j,,4, are nonnegative integers, and 


Pt ce eNOS. ESE 2, ..., n) 


We consider a particle with initial state S; Let {x} be s-digit 
numbers less than unity uniformly distributed on the interval 
[0, 1]; for example, the elements of a table of random numbers. 
Let us generate the random number x. Ii it happens that the 
inequality 


0<x< -Ë 





holds true, then we will take it that the particle moves from 
state S; to state S,. Furthermore, if 


li <x y< aLe mtii 


105 = 





then we assume that the particle moves from S; to S,. The other 
transitions are defined in a similar manner. In particular, a par- 
ticle hits the boundary S,4,=T if the random number x is such 
that 

tit iega < tit = 7 act ches =] 


On the basis of the given convention it is clear that the number 
of favourable cases for the transitions S;— S; (j=1, 2, ..., n+1) 
are proportional to the respective numbers 


lns bigs oles bi, nay 


and these cases are equally ‘probable. Therefore the transition 
probabilities 


P(S eee oe = Pi (i=1,..., n, j=l, ..., n41) 


672 Ch. 17. The Monte Carlo Method 


Extracting a sequence of random numbers and taking the above 
rule for-guidance, we get the random walk of a particle with fixed 
initial state and given transition probabilities. To obtain the 
required accuracy of the roots (in a probabilistic sense) one must 
consider a sufficiently large number of independent random walks. 


Example. Solve the following system of equations by the Monte 
Carlo method: 
x, =0.1x,40.2x +07, 
xy it 2 } (12) 


x, = 0.2x,—0.3x,+ 1.1 


Solution. We can put 
vu =l, v=l, 
Vy =l, Vo, =— | 


whence the transition probability matrix is 


0.1 0.2; 0.7 
=| 0.2 0.3! 0.5 


terssrepgrssesee 


0 0 1 


where the elements of the first row are, respectively, the probabi- 
lities of transition from state S, to states S,, S, and S,=T, and 
the elements of the second row are those from ‘state S, to, states 
S,, S, and S,, the “fringe” corresponding to the boundary I 

Since the elements of the matrix H are multiples of 0.1, one 
can use single-digit random numbers whose digits are recruited 
from some randon sequence, say, are elements of the random 
numbers of Table 76 (Sec, 17.3). 

The results obtained for 20 random walks with initial state S, 
are listed in Table 79. The random number x ensured the transi- 
tions of the states in accord with the following instructions: 

I. For initial state S,: 

(1) ii O<x<0.1, then 8,—S,, 
(2) if O1<% < 0.3, then S, — $, 
(3) ff O3<x<1, then 8, —T 

H. For initial state S; 

(1) if Ox <0.2, then S,—S,, 
(2) if 0.2& x< 0.5, then S,—S,, 
(3) i 05<x<1, then S,—T. 

The values of the random quantity X, computed from. formula (6) 

are listed in the last column of Table 79. From this 


xı = MX, =z (20-0.7+0.7+4-1.1)=0.96 


17.5 Solving systems of lin. algebr. eqs. by Monte Cario 673 . 


TABLE 79 


FINDING THE UNKNOWN x, OF SYSTEM (12) 
BY THE MONTE CARLO METHOD 














a a 
No. Randon, Random-walk trajectory a Mom 
1 0.5 3 —T 0.7 
2 0.7 S —T 0.7 
3| 07 Ser 0.7 
4 3 
os Sees 0.740.7 
5 | 0.7 S237 0.7 
6 0.1 
ae \ Ci eee 0.7411 
7 0.1 
0.8 \ S1 — S, — TF 0.7-1.1 
8 0.7 Sı —T 0.7, 
9 | 0.3 S —>T 0.7 
10 | 0.7 S —T 0.7 
il O.1)- 
0.0 re eer ee 0.7+1.1+0.7 
12 0.0 \ 
0.1 
e SeS SS 0.740.741.1— 
01 — 8, — 5 — T — 1.1] — 0.7 — 1.1 
0.6 ` 
13 0.9 S —T 0.7 
14 | 0.6 S, —F 0.7 
15 0.1 
0.5 S1 — S, — F 0.7-1.1 
16 | 0.3 Ser 0.7 
17 0.3 Sı, — F 0.7 
18 0.2 
0.4 . 
0.4 SS aes oT ey eon itp 
ne — S, — S, —T +1.1—1.1— 0.7 
9.6 
19 | 0.6 Sek 0.7 
20 0.2 
ae Sı — S, — I 0.7+1.] 
| | > 21-0.7+44.1.1 





43 9616 


674 Ch. 17. The Monte Carlo Method 


The unknown x, is computed in similar fashion. 

Note that the exact roots of system (12) are x =1 and x,=1. 

Other methods are also employed for solving algebraic linear 
equatiorts by the Monte Carlo method [11]. 


REFERENCES FOR CHAPTER 17 


[1] E. S. Ventsel, The Theory of Probability, 1958, Chapters I-VI (in Russian). 

[2] B. V. Gnedenko, The Theory of Probability, 1969, Chapters 1-6 (transla- 
ted from the Russian). 

[3] A. S. Householder, Principles of Numerical Analysis, 1953, Chapter 8. 

[4] W. ie Milne, Numerical Solution of Differential Equations, 1953, Appen- 
dix C. : ; 

[5] Yu. A. aaa A Method of Statistical Trials (Monte Carlo), 1956 (in 
Russian). r 

[6] Edwin F. Beckenbach (editor), Modern Mathematics for the Engineer, First 
Series, 1956; Chapter 12, Monte Carlo Methods by George W. Brown. 

[7] P. M. Morse and G. E. Kimball, Methods of Operations Research, 1951, 
Chapter 6, Sec. 4. 

[8] M. Kadyrov, Tables of Random Numbers, 1936 (in Russian). 

[9] A. /. Kitov and N. A. Krinitsky, Electronic Digital Computers and Prog- 
ramming, 1959, Chapter VIII (in Russian). g 
[10] P. Davis, P. Rabinowitz, Some Monte Carlo Experiments in Computing 

Multiple Integrals, 1956. 
[11] Yu. A. Shreider, Solution of Systems of Linear Algebraic Equations by the 
Monte Carlo Method, 1958 (in Russian). 


COMPLETE LIST OF REFERENCES 


Beckenbach, Edwin F. (editor), Modern Mathematics for the Engineer, First 
Series, McGraw-Hill Book Company, New York, 1956 (ior Chapters 8, 13, 17). 

Berezin, I. S. and Zhidkov, N. P., Computational -Methods, Fizmatgiz, 1959, 
in Russian’ (for Chapters 8, 12, 16). ' 

Bezikovich, Ya. S., Calculus of Finite Differences, LGU, 1939, in Russian (for 
‘Chapter 6). 

Bezikovich, Ya. S., Approximate Computations, Gostekhizdat, 1949, in Russian 
(for Chapters 1,4). 

Bradis, V. M., The Theory and Practice of Computations, Moscow, Uchpedgiz, 
1935, in Russian (for Chapter 14). 

Bradis, V. M., Oral and Written Computing. Computational Aids, Encyclopedia 
of Elementary Mathematics, Book 1, Gostekhizdat, 1951, in Russian (for 
Chapter 1). 

Boleakor, Be V., Oscillations, Gostekhizdat, 1954, in Russian (for Chapter 7). 

But, E., Numerical Methods, Moscow, Fizmatgiz, 1959, in Russian (for Chap- 
ter 13). 

Davis, P., Rabinowitz, P., Some Monte Carlo Experiments in Computing Mul- 
tiple Integrals, Math. Tables and Other Aids Comput., 1956, 10, No. 53, 
1-8 (for Chapter 17). ; 

Faddeyev, D. K. and Faddeyeva, V. N., Computational Methods of Linear 
Algebra, Fizmatgiz, 1960, in Russian (for Chapters 7, 8, 12). 

Faddeyeva, V. N., Computational Methods of Linear Algebra, Gostekhizdat, 1950, 
in Russian (for Chapters 7, 8, 9, 10, 11, 12, 14), 

Fikhtengolts, G. M., Mathematics for Engineers, GTTI, 1933, in Russian (for 
Chapter 1). 

Fikhtengolts, G. M., Principles of Mathematical Analysis, Vol. 1, Gostekhizdat, 
1955, in Russian (for Chapter 2), ` 

Fikhtengolts, G. M., Course of Differential and Integral Calculus, Nauka, i969, 
in Russian (for Chapters 3, 16), 

Fikhtengolts, G. M., Course of Differential and Integral Calculus, Gostekhizdat, 
1957, in Russian (for Chapter 4). 

Fikhtengolts, G. M., Principles of Mathematical Analysis, Vol. 11, Gostekhizdat, 
1956, in Russian (for Chapter 6), 

Frazer, R. A., Duncan, W. J., Collar, A. R., Elementary Matrices and Some 
Applications to Dynamics and Differential Equations, New York, The Macmillan 
Company, 1946 (for Chapter 7). 

Fuks, B. A. and Shabat, B. V., Functions of a Complex Variable, Moscow-Le- 
ningrad, Gostekhizdat, 1949, in Russian (for Chapter 5). 

Gantmacher, F. R., The Theory of Matrices, Moscow, Gostekhizdat, 1953, in 
Rein or translated from the Russian, Chelsea New York, 1959 (for Chap- 

er 10), 


676 Complete List of References 


Gavurin, M. K., Application of Polynomials of Best Approximation to the 
Acceleration of Convergence of Iterative Processes, Uspekhi Matem. Nauk, 5:3 
(37), 1950, in Russian (for Chapter 12), 

Gelfand, 1. M., Lectures on Linear Algebra, Moscow-Leningrad, Gostekhizdat, 
1951, in Russian (for Chapters 10, 12). 

Gelfond, A. O., Calculus of Finite Differences, Gostekhizdat, 1952, in Russian 
(for Chapters 4, 5, 6). 

Gnedenko, B. v’, The Theory of Probability, Moscow, MIR Publishers, 1969, 
translated from the Russian (for Chapter 17). 

Goncharov, V. L., The Theory of Interpolation and Approximation of Functions, 
Moscow- -Leningrad, GTTI, 1934, in Russian (for- Chapter 14). 

Grave, D., Elements of ‘Higher Algebra, Kiev, 1914, in Russian (for 

Chapter 5). 

Hildebrand, F. B., /ntroduction to Numerical Analysis, New York, McGraw- Hill 
Book Company, ‘1956 (for Chapter 5). 

Householder, Alston S., Principles of Numerical erates New York, McGraw- 
Hill Book Company, 1953 (for Chapters 10, 13, 

Kadyrov, M., Tables of Random Numbers, Tae Publishing House of State 
University ‘of Central Asia, 1936, in Russian (for Chapter 17), 

Kagan, B. M. and Ter- -Mikaelyan, T. M., Solution of Engineering Problems 
on Automatic Digital Computers, Moscow- Leningrad, Gosenergoizdat, 1958, 
in Russian (for Chapter 3). 

Kantorovich, L. V., ‘On Newton’s Method, Transactions of the Steklov Institute 
of Mathematics, XXVIII, 1949, in Russian (for Chapters 4, 13). 

Kantorovich, L. V., Krylov, V. 1., Approximate Methods of Higher Analysis, 
Gostekhizdat, 3rd ed., 1949, in Russian (for Chapter 6). 

Khinchin, A. Ya., Continued Fractions, Gostekhizdat, 1949, in Russian (for 
Chapter 2). 

Khovansky, A. N., Application of Continued Fractions and Their Generalizations 
to Problems of ’ Approximate Analysis, Gostekhizdat, 1956, in Russian (for 
Chapters 2, 3). 

Kitov, A. I. and Krinitsky, N. A., Electronic Digital Computers and Program- 
ming, Moscow, Fizmatgiz, 1959, in Russian (for Chapter 17). 

Krylov, A. N., Lectures on Approximate Computations, Leningrad, Academy of 
Sciences, USSR, 2nd ed., 1933, in Russian (for Chapters 1, 5, 16). 

Krylov, A. N., Lectures on Approximate Computations, Gostekhizdat, 6th ed., 
1954, in Russian (for Chapters 6, 15). 

Krylov, V. L, Approximate Calculation of integrals, Moscow, Fizmatgiz, 1959, 
in Russian (for Chapter 16). 

Kurosh, A. G., Course of Higher Algebra, ree MIR Publishers, 1972, tran- 
slated from the Russian (for Chapters 5, 12), 

Lednev, N. A., (editor), Mathematical ate Session Devoted to Computing 
Instruments and Machines, Soviet Science, 1959, Moscow, in Russian (for 
Chapter 14). 

en E E. S., Course of Higher Algebra, Uehpedgiz, 1953, in Russian (for 
Chapter 7). 

Pyueternile DA A., Transactions of the Steklov Institute of Mathematics, 20, 1947, 
page 49, in Russian (for Chapter 12). 

Lyusternik, L. A., Abramov, A. A., Shestakov, V. 1I., Shura-Bura, M. R., The 
Solution of Mathematical Problems on Automatic "Digital Computers, USSR 
Academy of Sciences Publishing House, 1952, in Russian (for 
Chapter 3). 

Maltsev, A. I., Principles of Linear Algebra, Moscow-Leningrad, Gostekhizdat, 
1948, in Russian (for Chapter 10). 

Maltsev, A. I., Principles of Linear Algebra, Gostekhizdat, 2nd ed., 1956, in 
Russian (for ’ Chapter 7). 

Markoy, A., Calculus of Finite Differences, Matezis, 2nd ed., 1911, in Russian 
(for Chapters 3, 6, 16), 


Complete List of References 677 


Mikeladze, Sh. E., Numerical Methods of Mathematical Analysis, Moscow, Gos- 
tekhizdat, 1953, in Russian (for Chapters 15, 16). 

Milne, W. E., Numerical Calculus, Princeton, New Jersey, Princeton University 
Press, 1949 (for Chapters 12, 14, 15, 16). 

Milne, W. E., Numerical Solution of Differential Equations, New York, London, 
1953 (for Chapters 13, 17). 

Mlodzeyevsky, B. K., Solution of Numerical Equations, Moscow, GIZ, 1924, in 
Russian (for Chapter 5). 

Morse, P. M. and Kimball, G. E., Methods of Operations Research, New York, 
195], Chapter 6, Sec. 4. (for Chapter 17). 

Nikolsky, S. M., Quadrature Formulas, Moscow, Fizmatgiz, 1958, in Russian 
(for Chapter 16). 

Ostrowski, Alexander M., Sur la convergence et l'estimation des erreurs dans quel- 
ques procédés de résolution des équations numériques (Collection of Papers in 
Memory of D. A. Grave, Moscow GITTL, 1940, p. 213) [Sbornik posvyash- 
chonny pamyati D. A. Grave] (for Chapter 13), 

Perron, O., Die Lehre von den Kettenbriichen, Leipzig, Teubner, 1929 (for Chap- 
ter 2), 

Remez, E. Ya., General Computational Methods of Chebyshev Approximation, 
Ukrainian Academy of Sciences Publishing House, 1957, in Russian (for 
Chapter 14). 

Runge, Carl, Graphische Methoden, Teubner, Leipzig-Berlin, Zweite Auflage, 
1919 (for Chapter 15). l 

Salekhov, G., Calculation of Series, Gostekhizdat, 1955, in Russian (for Chapter 6). 

Salvadori, Mario G., and Baron, M. L., Numerical Methods in Engineering, 
New York, Prentice-Hall, Inc., 1952 (for Chapters 8, 16). 

Scarborough, James B., Numerical Mathematical Analysis, Baltimore, Johns Hop- 
kins Press, 3rd ed., 1955, (for Chapters 1, 4, 5, 8, 13, 14, 15, 16). 

Schreier, O. und Sperner, E., Vorlesungen über Matrizen, Leipzig, Teubner, 
1932 (for Chapter 7). 

Shapiro, G. M., Higher Algebra, Moscow, GUP1,. 4th ed., 1938, in Russian (for 

hapter 5). 

Shilov, G. E, Introduction to the Theory of Linear Spaces, Moscow-Leningrad, 
Gostekhizdat, 1952, in Russian (for Chapter 10). 

Shreider, Yu. A., The Solution of Systems of Linear Algebraic Equations, Do- 
klady Akademii Nauk SSSR, 5, 1951, in Russian (for Chapter 10). 

Shreider, Yu. A., Solution of Systems of Linear Algebraic Equations by the 
Monte Carlo Method, Sbornik 1: “Problems in the Theory of Mathematical 
Machines”, Moscow, Fizmatgiz, 1958, in Russian (for Chapter 17). 

Shreider, Yu. A., A Method of Statistical Trials (Monte Carlo), /astrument Design 
Journal, No. 7, 1956, in Russian (for Chapter 17). 

Smirnov, V. 1., Course of Higher Mathematics, Vol. 3, Moscow-Leningrad, 
GTTI, 1933, in Russian (for. Chapter 11). 

Smirnov, V. I., Course of Higher Mathematics, Vol. 1, Moscow, Gostekhizdat, 
17th ed., 1957, in Russian (for Chapter 3). 

Smolitsky, Kh. L., Computational Mathematics (Lecture Notes), Leningrad, 
Mozhaisky LKVVIA, 1960, in Russian (for Chapter 8). 

Steflensen, J. F., /nterpolation, Baltimore, Williams and Wilkins, 1927 (for 
Chapter 16). 

Tolstov, G, P., Fourier Series, Gostekhizdat, 1951, in Russian (for Chapter 6). 

Tolstov, G. P., Course of Mathematical Analysis, Vol. 1, Gostekhizdat, 1954, in 
Russian (for Chapter 4). 

Tolstov, G. P., Course of Mathematical Analysis, Vol’ I1, Moscow, Gostekhizdat, 
1957, in Russian (for Chapter 3). - 

Vallée Poussin, C. J., de la, Cours d’Analyse Infinitesimale, I (4th edition), 
1921 (for Chapter 6). : 

Ventsel, E. S., The Theory of Probability, Moscow, Fizmatgiz, 1958, in Russian 
(for Chapter 17), i 


[3 


678 Complete List of References 


Ventsel, D. A., Ventsel, E. S. Elements of the, Theory of Approximate Compu- 
tations, Moscow, Publishing House of VVIA named after N. E. Zhukovsky, 
1949, in Russian (for Chapters 1, 4, 13). 

Wayland, Harold, Expansion of Determinantal Equations into Polynomial Form, 
Quarterly of Applied Mathematics, Vol. 11, No. 4, Jan. 1945, pp. 277- 306 
(for Chapter 12), 

Whittaker, E. T., and Robinson, G., The Calculus of Observations, London and 
Glasgow, 4th edition, 1944 (for Chapters 4, 7, 14). 

Zaguskin, V. L., Handbook on Numerical Methods of Solving Algebraic and 
Transcendental Equations, Fizmatgiz, 1960, in Russian (for Chapter 5). 


Abel, N, H. 

Euler-Abel method 209 

Euler-Abel transformation 210 
Abramov, A. A. 114, 676 
absolute error 19 
absolute value of a matrix 242 
absorbing barrier 667 
acceleration of convergence 

of Fourier trigonometric sefies by 

Krylov’s method 217 
of numerical series 203 


of power series by Euler-Abel 
method 209 
of series 89 
accumulation, method of 306, 640 
accuracy 


in determination of arguments from 
a tabulated function 48 
estimation of 17 
of quadrature formulas 618 
additive inverse of a matrix 232 
adjoint of a matrix 236 
algebra 
fundamental theorem of 162 
matrix 229 
algebraic equations 
approximate solution of 
techniques for) 162fi 
bounds of real roots of 167 
general properties of 162f 
algebraic volume 656 


(special 


algorithm 
Euclid’s 165 
Hero’s 107 


alternating sums, method of 169 
analysis, classical 5 


ahalytue NEHER computing values of 
8 


angle between two vectors 345 
approximate differentiation 574ff 
formulas of, based on Newton’s 
first interpolation formula 575 
formulas of, based on Stirling’s for- 
mula 580 
approximate integration 590f 
approximate numbers 19ff 
approximate solution of algebraic equ- 
ations, special fechniques for 
1625 
approximation 
of improper integrals 633 
major 19 
minor 19 
trigonometric .225 
argument, principle of 166 


back substitution (see direct procedure) 
280 


backward extrapolation 529 
backward interpolation 529 
Barlow’s tables 113 
Baron, M, L. 321, 648, 677 
barrier, absorbing 667 
bases {see basis) 
basis (bases) 
pribeeonal 391 
normalized orthogonal 346 
orthogonal 346 
orthonormal 346 


INDEX 


of a space 340 
initial 341 

Beckenbach, E. F. 321, 506, 674, 675 
Berezin, I. S. 321, 458, 645, 675 
Bernoulli method 198f 
Bernoulli numbers 99, 208, 625ff 

generating function of 627 
Bernstein, S, N. 608 
Besse!’s formula for parabolic interpo- 

lation 535 

Bessel’s function of order zero 582 
Bessel’s inequality 216 
Bessel’s interpolation formula 534, 535 
Bezikovich, Ya. S. 54, 161, 228, 675 
bilinear expansion of a matrix 392 
bilinear form of a matrix 384 
biorthogonal bases 39] 
biorthogonality, conditions of 390 
biorthogonality relations 390, 393 
blocks of a matrix 256 
bordered matrices 257 
bordering, method of (matrices) 262 
bounds 

method of 50 T 

of real roots of algebraic equations 167 
Bradis, V. M. 54, 573, 675 
Brown, G. W. 674 
Budan 

theorem of Budan- Foiris 175, 176 
Bulgakov, B. V. 272 
But, E. 506, 675 


canonical convergents 59 

canonical norm of a matrix 243 
Cardan’s formula 186 

Cauchy inequality 245 

Cauchy test 83 

Cayley-Hamilton theorem 397 

central derivatives, formulas for 580 
central diflerences 530 

central formulas 580 

central-diflerence formulas 53] 
chain, discrete Markoy 667 ` 
chain law 668 
change-of-basis matrix 348 
characteristic determinant 
characteristic equation 376 
characteristic matrix 376 
characteristic number 375, 377 
characteristic polynomial 376, 382 
characteristic root 375 


376, 565 


„Chebyshev, P. L, 554 


Chebyshév polynomial 554 
Chebyshev’s quadrature formula 607ff 
check 

final 17 

intermediate 17> 
check sums 281 
chords, method of 122 
class, periodicity 214 
classical analysis 5 
coefficients 

Cotes 594, 600 

Fourier 213 

Lagrangian 543 
Collar, A. R. 272, 675 
column vector 229 
combination method 136 
commutative matrices 234 


680 


complex roots 
one pair of 190 3 
two pairs of 194, 197 
Compete ie of continued fractions, terms 


computation(s) (see also computing) 
of determinants 269 
double, method of 50 
in which errors are not taken into 
exact account 42 
computation sheets 16 
computational scheme of Danilevsky’s 
method 416 
computation work, rules of 15ff 
computing (see also computation) 
analytic functions 89 
cube roots {12 
exponential functions 91 
forms 16 
functions 77ff 
hyperbolic functions 101 
logarithmic functions 95 
polynomials 77 
tational fractions 82 
reciprocals 104 
reciprocals of square roots 111 
square roots 107 
trigonometric functions 98 


conditions 
of biorthogonality 390 
Hurwitz 406 
Sylvester 389 
conformal partitioned matrices 257 
continued fraction(s) 55 
components of 55 
conversion of to a simple fraction 56 
expanding functions into 72f 
infinite 55 
meomponent 55 _ 
Nonterminating 66 
convergent 66 
divergent 67 
simple 56 
standard 56 
theory of 557 


contraction mapping 487ff 
convergence 
accelerating {by Lyusternik method) 
453, 458 
of Fourier trigonometric series (acce- 
leration of by Krylov method) 217 
of iteration processes 
first sufficient condition of 491 
second sufficient condition of 493 
of iteration processes for systenis of 
linear equations 322ff, 3949 
necessary and sufficient conditions 
or 
sufficient conditions for 322 


of matrix power series 394 
methods for effectively checking the 
conditions of 405 
of the Newton process 465, 469 
rapidity of 474 
stability of 478 
of numerical series, accelerating 203 
of power series, acceleration of by 
Euler-Abel method 209 
of series, acceleration of 89 
of Seidel process 
first- sufficient condition for 327 
necessary and sufficient conditions 
for 400 : 
for a norma! system 403 


Index 


second sufficient condition for 330 

third sufficient condition for 333 
convergence theorem 214 
convergent integral 633, 635 
convergents 58f 

canonical 59 A 

law of formation of 59 
coordinates of a vector 336 

in a basis 341 
correct digit 25 
Cotes 

Newton-Cotes quadrature formulas 
593 


Newton-Cotes formulas 599 
Cotes coefficients 594, 600 
Cramer’s formulas 276 
Cramer’s rule 273 
cubature, mechanical 590 
cubature formulas 590, 641f 

of Simpson type 644 

Simpson's 644 
cube roots, computation of 112 


Danilevsky, A. M. 
method of 412 
computation of eigenvector’s by 420 
computational scheme of 416 
exceptional cases in 418 
Davis, P. 674, 675 
degenerate linear transformation 374 
Del 497 
delta 
Kronecker 230 
the operator A 508 
derivatives 
central 580 
partial 588 
Descartes’ rule of signs 178 
descent, steepest (see method of steepest 
descent), 496, 499 
determinant(s) 230 
characteristic 376, 565 
computation of 269 
Gaussian method in computing 288 
secular 376, 565 
expansion of 410ff, 429 
Vandermonde 592, 614 
diagonal difference table 511 
diagonal matrix 229, 
difference(s) 
central 530 
divided 554fi 
double (of higher order) 570 
error of 34 
finite 5079 
lambda (A- difference) 440 
difference equation 198 
difference table 510 
diagonal 511 
horizontal 51] 
differentiation 
approximate 574ff 
graphical 586 
numerical 583 
differentiation operator 629 
digit 
correct 25 
significant 24 
dimensionality of a subspace 342 
direct procedure (see forward substi — 
tion) 280 
discrete Markov chain G67 
distribution function 651 
distributivity 344 





Index 


divergent integral 633, 635 
divided differences 554i 

table of 556 
double computation, method of 50, 622 
double difierences of higher order 570 
double-entry table 569 
Duncan, W, J. 272, 675 


eigenvalue(s) 375, 377, 382, 445 
extremal property of 387 
finding (of a matrix) 4109 
finding the first 436 
finding the numerically largest 430 
finding the second 439 
eigenvector(s) 375, 382, 445ff 
computation of by Danilevsky’s me- 
thod 420 
computation of by Krylov’s method 
424 


finding (see eigenvalue) 410%, 430, 439 
element(s) 
of a matrix 229 
principal 287 
Emde 524, 568 
empirical formula 524 
entries (of a matrix) 229 
equal efiects, principle of 45 
equation(s) 
algebraic 162 
characteristic 376 
difierence 198 
equivalent 119 
graphical solution of 119 
linear 273 
nonlinear {see systems of nonlinear 
equations) 459 
root of 115ff 
secular 376 
solution of 199 
equivalent equations 119 
equivalent matrices 208 
error(s) 19 
absolute 19 
computations in which errors are not 
taken into exact account 42 
of a difference 35 
epsilon (e-error) (propagation law 
of) 515 


general formula for 42 
initial 23 
limiting absolute 20 
limiting relative 21 
of method 22, 621 
of operation 23, 84 
of the problem 22 
of a product 37 
of a quotient 40 
relative (see relative error) 19, 21 
residual - 23 
rounding 23 
sampling 666 
sources of 22ff 
of a sum 33 
theory of 42 
error estimate, probability 52 
escalator method 320 
Euclid’s algorithm 165 
Euler, L. 58 
Euler-Abel method 209 
Euler-Abel transformation 210 
Euler-Maclaurin formula 628, 630 
Euler-Maclaurin summation formula 
630 
even-digit rule 26 


681 


exact methods 273 
exact number 19 
eahaustion, method of 443, 444 
expansion 
bilinear (of a matrix) 392 
of secular determinants 410ff 
Sap eae of diferent methods of 


9 

Stirling's 207 
exponential functions 50, 91 

computing values of 91 
extrapolation 519 

backward 529 

forward 529 

Richardson 622 
extremal property of eigenvalues . 387 


factorial(s) (see generalized power) 517 
inverse 205 
Faddeyev, D. K, 272, 321, -458, 657 
Faddeveva, V, N, 272, 321, 335, 393, 
409. 458, 573, 675 
Fikhtengolts, G. M. 54, 76, 114, 161] 
228, 648, 675 
final check 17 
finite differences 507% | 
finite-difierence operator 628 
form(s) 
bilinear (of a matrix) 384 
computing 16 
Frobenius standard 412 
quadratic (see quadratic form) 311 
formula(s) 
of approximate difierentiation 
based on Newton's first interpola- 
tion formula 575 
based on Stirling’s formula 580 
Bessel’s interpolation 534, 535 
Bessel s, for parabolic interpolation 


Cardan’s 186 
central 580 
for central derivatives 580 
central interpolation 552 
central-difierence 53] 
Chebyshev’s quadrature 607ff 
Cramer’s 276 
cubature 590, 641ff 
of Simpson type 644 
Simpson’s 64 
empirical 524 
Euler-Maclaurin 628, 630 
Euler-Maclaurin summation 630 
Gauss’ quadrature 611, 614 
Gaussian interpolation 531, 532, 533 
eneral trapezoidal 601 
or interpolating to halves 536 
interpolation (see interpolation for- 
mulas) 
Lagrange’s interpolation 539, 541 
Lambert’s 100 
Markov’s 566 
Newion cote (of higher orders) 
9 
Newton-Cotes quadrature 593 
‘Newton-Leibniz 590 
Newton's first interpolation 519, 522 
Newton’s second interpolation formu 
la 526, 527 
Newton’s quadrature 599 
quadrature 590 
accuracy of 618 


> 


682 


of closed type 591 
Newton's 599 
of open type 591 
Simpson’s (and its remainder term) 
596 . 


0 
Simpson’s general 603 
Stirling’s interpolation 533 
trapezoida] (and its remainder term) 
595 


Forsythe, G, E. 32] 
forward extrapolation 529 
forward interpolation 529 
forward substitution (see direct proce- 
dure) 280 
Fourier 
theorem of Budan-Fourier 175, 176 
Fourier coefficients 213 
estimates of 213 
Fourier trigonometric series ~213 
fractfon(s) 
continued (see continued fractions) 55 
rational (see rational fractions) 82 
Frazer, R, A. 272, 675 
Frobenius matrix 412, 415, 418 
Frobenius standard form 412 
Fuks, B. A. 202, 675 
function(s) 
analytic 89 
Bessel’s (of order zero) 582 
computing values of 77f 
distribution 651 
exponential 50, 91 
hyperbolic 101 
interpolating 518 
interpolation of 507f, 519 
iteration for approximating the values 
- of 103ff 


jump 221 g 

logarithmic 95 

signum 86 

trigonometric 49, 98 

zero of 115 
fundamenta] system of solutions 365 
fundamental theorem of algebra 162 


Gantmacher, F. R. 393, 675 

Gauss’ first interpolation formula 532 
Gauss’ quadrature formula 611, 614 
Gauss’ second interpolation formula 532, 


Gaussian interpolation formulas 531, 


Gaussian method 277! 

inversion of matrices by 290 

use of in computing determinants 288 
Gaussian random sequence 652 
Gavurin, M. K. 458, 676 

method of 458 
Gelfand, I. M. 393, 458, 676 
Gelfond, A. O. 161, 202, 228, 676 
general formula for errors 42 
general trapezoidal formula 601 
generalized power 517 
generate (see space, generated by vectors) 

342 


Gnedenko, B. V. 674, 676 
Goncharov, V. L. 573, 676 
gradient (of a function) 497 
gradient method 496 
Graefie 182 
method of Lobachevsky-Graefie 
179, 182 
graphical differentiation 586 


Index 


graphical integration 639 
graphical solution of equations 119 
Grave, D. 202, 676 


halving method 121 

Hermitian symmetry 343 

Hero’s algorithm 107 

Hero's’ process 107 

Hildebrand, F. B. 201, 202, 676 

horizontal difference table 511 

Horner’s scheme 77, 78 
generalized 80-82 

Householder, A- S. 393, 506, 674, 676 

Hua’s theorem 179 

Hurwitz conditions 405 

Hurwitz theorem 406 

hyperbolic cosine 10] 

hyperbolic functions IOI 

hyperbolic sine 101 

hyperbolic tangent 102 

hypercube, unit m-dimensional 656 


1 
identical transformation 375 
improper integral(s) 633 
aproximation of 633 
improving roots 284 
inequality 
Bessel 216 
Cauchy 245 
infinite continued fraction 55 
initial basis of a space 341 
initial error 23 
integral(s) 
convergent 633, 635 
divergent 633, 635 
improper 633 
multiple 656 
probability 561 
proper 633 
intergation 
approximate 590ff 
graphical 639 
intermediate check 17 
interpolating function 518 
interpolating to halves, formula for 536 
interpolation 519 
backward 529 
forward 529 
of functions 507, 519 
of functions of two variables 567 
inverse (see inverse interpolation) 
method of 411 
in the narrow sense 519 
parabolic 522 
problem of (statement of) 518 
interpolation formula(s) 
Bessel’s 534, 535 
remainder term of 552 
central -552 
with constant interval (general) 536f 
Gauss’ first 532 
Gauss’ second 533 
Gaussian 531 
Lagrange’s 539, 541 
error estimate of 547 
Newton’s 519 
error’s estimates of 550 
for a junction of two variables 571 
for unequally spaced values of the 
argument 556, 558 
Newton's first 519, 522 





Index 


Newton’s second 526, 527 
Stirling’s 533 
remainder term of 552 
interpolation method, for cxpanding 
a secular determinant 
interpolation points 518 
best choice of 553 
interval (see spacing) 507 
variable 555 
invariance of a linear subspace 379 
inverse, additive (of a matrix) 232 
inverse factorials 205 
inverse interpolation 
for case of equally spaced points 559 
for case of unequally spaced points 562 
finding roots of an equation by 564 
Inverse matrix 236 
correcting elements of an approximate 
316 
properties of 239 
inverse problem of theory of errors 44 
second 48 
inverse transformation 373 
inversion of matrices 236 
by Gaussian method 290 
solution of systems of Jinear equations 
by 273 
isolation 
of roots 115 
of singularities by Kantorovich meth- 
od 63 k 
iteration (see linear system) 
for approximating the values of a 
function 103f 
method of 138ñ,; 300, 302, 484f 
process of, convergence of (see con- 
vergence of process of iteration) 
491, 493 
of a vector 431 
ileration processes 
convergence of (for systems of linear 
equations) 322i, 394ff 
estimate of the error of approximations 
in 324 . 
iterative methods (see methods, itera- 
tive, and methods of iteration) 


Jacobi matrix 465, 470 
Jacobian 156, 657 
Jahnke 524, 568 

Jump function 221 


Kadyrov, M. 674 

Kagan, B, M, 114, 676 

Kantorovich, L, V. 161, 228, 465, 481, 
506, 676 method of {for isolating 
singularities) 635 

Kantorovich theorem 465 

Khaletsky, scheme of 295, 297 

Khinchin, A. Ya. 76, 676 

Khovansky, A, N. 76, 114, 676 

Kimball, G. E. 674, 677 

Kitov, A. I. 674, 676 

Krinitsky, N. A. 674, 676 

Kronecker delta 230 


Krylov, A. N. 54, 189, 202, 217, 228, 
589, 648, 676 


i , 67 
method of 217, 225, 411, 4211 
computation of eigenvectors by 424 
Krylov, V. 1. 225, 648, 676 
Kummer transformation 203, 204 
Kurosh, A, G, 202, 409, 458, 676 


683 


Lagrange. method of 169 
Lagrange’s interpolation formula 539, 
541 


Lagrange’s theorem 168 
Lagrangian coefficients 543 
computing 543 
computational scheme for 546, 547 
lambda-difference 440 
Lambert’s formula 100 
latent root 375 
law 
chain 668 
propagation (cf &-error) 515 
Lednev, N, A. 573, 676 
Legendre polynomials 611 
properties of 611 
Leibniz , 
Newton-Leibniz formula 590, 
length of a vector 345 
Leverrier, method of 411 
limiting absolute error 20{f 
limiting relative error 21 
tables for determining 30 
Lin’s method 201 
linear dependence of vectors 337 
linear equations, solving- systems of 
273f 


linear subspace 341 
invariance of 379 
linear system 
normal 312 
reducing (to a form convenient for 
iteration) 307 
linear transformations 
degenerate 374 
operations with 371 
singular 374 
of variables 367 
linear vector spaces 336f 
theory of 336ff 
linear-transformation operator 369 
linearly dependent vectors 337 
linearly independent vectors 337 
lines (of a matrix} 229 
Lobachevsky, N. L 182 
Lobachevsky-Graeffe method 179, 182 
for case of complex roots 187, 192 
for case of real and distinct roots 184 
logarithmic function 95 
logarithms 49 
loss of accuracy in subtraction 35 
Lyapin, E. S. 272, 676 
Lyusternik, L. A. 114, 453, 458, 676 
method of 453i 


Maclaurin 

Euter-Maclaurin formula 628, 630 
Maclaurin’s series 89 
major approximation 19 
Maltsev, A. 1, 272, 393, 676 
mapping, contraction 487i 
Markov, A. 114, 228, 648, 676 
Markov chain, discrete 667 
Markov’s formula 566 
matrices (see matrix) 
matrix (matrices) 

absolute value of 242 

adjoint of 236 


684 


bilinear expansion of 392 

bilinear form of 384 

bordered 257 

change-of-basis 348 

characteristic 376 

commutative 234 

conformal partitioned 257 

diagonal 229, 265 

difference of 231 

eigenvalues of 375 

eigenvectors of 375 

elementary transformations of 268 
equality of 230 

equivalent 268 

Frobenius 412, 415, 418 

inverse 236, 239 

inversion of (see inversion of matrices) 


273 
Jacobi 465, 470 
limit of 249 
minor of 248 
modulus of 242 
multiplication of 232 
by a scalar 231 
nonsingular 236 
norm of 242, 243 
nullity of 248 
operations involving 230 
orthogonal 350 
method of 363 
properties of 350 
orthogonalization of 351 
partitioned (see partitioned matrices) 
positive definite symmetric real’ 388 
powers of 240 
product of 231 
of quadratic form 31! 
quasidiagonal 256 
rank of 248 
rational functions of 241 
teal 389 
rectangular 229 
series of (see matrix series) 251 
similar 339, 380 
singular 236 
square 229 
sum of 23) œ 
symmetric 235, 384 
trace of 377 
transformation 368 
transition probability 667 
transpose of 234 
triangular 265 
unit 230 
zero 230 
matrix algebra 229 
matrix inversion 236 
by partitioning 260 
using the coefficients of the characte- 
ristic polynomial of a matrix 
for 450° 
matrix power series, convergence of 394 
matrix series, absolutely convergent 
252 


mechanical cubature 590 
mechanica! quadrature 590 
mesh points 518 
method(s) : 
of A. A. Abramov 458 
of accumulation 306, 640 
of alternating sums 169 
Bernoulli's 198% 
of bordering (matrices) 262 
of bounds 50 


Index 


of chords 122 
combination 136 
of A. M. Danilevsky 412 
computation of eigenvectors by 420 
computational scheme of 416 
exceptional cases in 418 
of double computation 50, 622 
for effectively checking the conditions 
of convergence 405 
escalator 320 
Euler-Abel 209 
exact (in solving systems of tinear 
equations) 273 
of exhaustion 443, 444 
Gaussian (see Gaussian methods) 2778 
of M. K, Gavurin 458 
gradient 496 
halving 121 
interpolation 411 
of iteration 138f, 300, 302, 484ff 
iterative (in solving systems of linear 
equations) 273 
of L, V. Kantorovich (for isolating 
singularities) 635 
of A, N, Krylov 217, 225, 411, 421f 
computation of eigenvectors by 424 
of Lagrange 169 


of Leverrier 411, 426 
Lin’s 20! 
of Lobachevsky-Graeffe 179, 182 
of L. A. Lyusternik 453 
modified Newton 135 
Monte Carlo 649% 
Newton 127, 171, 459f 
modified 4819 
nomographic 121 
of orthogonal matrices 363 
orthogonalization (see -orthogortaliza- 
tion methods) 
of N. V. Paluver 201 
of power series 504 
of principal elements 287 
of proportional parts 122 
of Purcell 320 
of relaxation 313ff 
of Richardson 320 
of scalar products (for finding first 
eigenvalue of a real matrix) 436 
Seidel 3091 
square-root 293 
of steepest descent 496, 499 
for a system of linear equations 501 
Sturm 173 
of successive approximations 138 
of tangents 127 
of undetermined coefficients 411, 428f 
Mikeladze, Sh. E. 589, 648, 676 
Milne, W. E. 458, 506, 573, 589, 648, 
674, 677 
Minor (of a matrix) 248 
Minor approximation 19 
Mlodzeyevsky, B. K. 202, 677 
modified Newton method 135 
modulus 242 
of a matrix 242 
Young’s 44 
Monte Carlo evaluation of muitiple 
integrals 656 
Monte Carlo method 649ff 
problems attacked by 650 
salving systems of linear algebraic 
equations by 666 
Morrey, C. B. Jr. 506 
Morse, P. M. 674, 677 
multiple integrals, Monte Carlo evalu- 


Index 


ation of 656 
multiplicity of a root 162 


nabla 497 
negative 
of a matrix 232 
of a vector 336 
Negative definite quadratic form 312 
Newton-Cotes formulas of higher or- 
ders 599 
Ree tonnes quadrature formulas 


Newton-Leibniz formula 590 
Newton’s first interpolation formula 
526, 
Newton’s interpolation polynomial 52! 
Newton's method 127, !71, 459i 
for complex roots. 157 
modified 135, 481f1 
for a system of two equations 156 
Newton's process, convergence of 465, 
469 


rapidity of 474 
stability of 478 
Newton's quadrature formula 599 
Newton’s second interpolation formula 
526, 527 
Newton’s, theorem 171 
Nikolsky, S. M. 592, 648, 677 
Nomographic methods 12! 
nonlinear equations (see systems of 
nonlinear equations) 459 
Nonsingular matrix 236 
nonsingular transformation 374 
nonierminating continued fractions 66 
convergent 67 + 
divergent 67 
norm of a matrix 242, 243 
canonical 243 
k-norm, (norm, m-norm 243 
Norma! linear system 312 
norma! system 31! 
Normalized orthogonal basis 346 
notation 
powers-of-ten 18, 24 
scientific 18, 24 
nullity of a matrix 248 
number(s} g 
approximate !9ff 
Bernoulli 99, 208, 6251 
characteristic 375, 377 
exact 19 
pseudorandom 653 
random (see random numbers) 650 
numerical differentiation, formulas for 
equally spaced points 583 
numerica! series 83 
approximation of sums of 83 
convergent 83 


operator 
differentiation 629 
finite-diflercnce 628 
linear-transformation 369 
orthogonal basis 346 
orthogonal matrices 350 
method of 363 
Properties of 350 
orthogonal systems of vectors 345 
orthogonal vectors 345 
orthogonality 345 
orthogonalization 


685 


of columns 358 
of matrices 351 
ef rows 361 
orthogonalization methods, application 
of to solution of systems of li- 
near equations 358 
orthonormal! basis 346 
Ostrowski, A. 465, 506, 677 
Ostrowski’s theorem 159, 465 


Paluver, N. V. 201 
method of 201 
parabolic interpolation 522 
parabolic rule 603 
partial derivatives, approximate cal- 
culation of 588 
partial quotients 56 
partitioned matrix (matrices) 256 
addition of 257 
conformal 257 
multiplication of 258 
subtraction of 257 
partitioning, matrix inversion by 260 
path 668 
periodicity class 214 
perpendicularity 345 
Perron, O. 76, 677 
Perron’s theorem 390 
point(s) 
interpolation (see interpolation points) 
518 


mesh 518 

of n-dimensional space 336 
polynomial(s) : 

characteristic 376, 382 

Chebyshev 554 

computing values of 77 

left 241 

Legendre 6! 

real roots of (number of) 173 

right 24 

Taylor 89 
positive definite quadratic form 312 
pois definite symmetric real matrix 

88 


positive definiteness, property of 343 
postmultiplication 236 
postmultiplier 269 
power 
generalized 517 
relative error of 41 
power series 
accelerating convergence of (by Eu- 
ler-Abel method) 209 
method of 504 
powers-of-ten notation 18, 24 
premuitiplication 236 
premultiplier 269 
principal element(s) 287 
method of 287 
principal row 287 
principle 
of the argument 166 
of equal effects 45 
Pringsheim theorem 7] 
probabilities, transition 667 > 
probability error estimate 52 
probability integral 561 
problem of interpolation (statement of) 
518 
procedure, direct and reverse 280 
process 
Hero’s 107 
Newton (see Newton process) 465 


root-squaring 182 

Seidel (see Seide! process) 155 
product 

error of 37 

of matrices 231 

number of correct digits in a 39 

scalar (of vectors) 343 

properties of 343 

projection transformation 369 
propagation law of the e-error 515 
proper integra! 633 
proportional parts, method of 122 
pseudorandom numbers 653 
pseudorandom sequence 654 
Purcell, method of 320 


quadratic form 31} 

matrix of 31! 

negative definite 312 

positive definite 312 
quadrature, mechanical 590 
quadrature formulas 590 

accuracy of 618 

Chebyshev’s 607f1 

of closed type 59! 

Gauss’ 611, 614 

Newton-Coles 593 

Newton’s 599 

of open type 59! 
quantity, random 65! 
quasidiagonal matrices 256 
quotient(s) 

error of 4! 

number of correct digits in a 4! 

partial 56 


random numbers 650 
generating 653 
tables of 653 
random quantity 65! 
tandom sequence 65! 
Gaussian 652 
tandomizing devices 653 
rank of a matrix 248 


rational fractions, computing values of 
82 


real matrices 389 
reciprocals 

computing 104 

of square roots, computing 113 
rectangular matrix 229 
relaticns, biorthogonality 390, 393 
relative error 19, 21 


of approximate number and number 


of correct- digits 27 
of a power 4 
of a root 4! 
relaxation, method of 3131. 
remainder term 89 


of Bessel’s interpolation formula 552 


Simpson’s formula and its 596 


of Stirling’s interpolation formula 552 


trapezoidal formula and its 595 
Remez, E. Ya. 573, 677 
residual 285 
residual error 23 


reverse procedure “(see back substitu- 
i 280 


tion) 
Richardson, method of 320 
Richardson extrapolation 622ff 
Rabinowitz, P. 674, 675 
Robinson, G. 161, 272, 573, 678 


Index 


Rolle’s theorem 133 
Toot(s) 
characteristic 375 
complex (see complex roots) 190 
of an equation 115 
existence of (of a system) 469 
finding (by inverse interpolation) 564 
‘fold (s-fold) 178 
improving 284 
isolation of 115 
latent 375 
multiplicity of 162 
real (bounds of) 167 
real (of a polynomial) 173 
relative error of 4! 
separated 180 
of a system of equations 274 
Toot squaring 182 
root-squaring process 182 
rotation transformation 370 
rounding of numbers 26f1 
rounding errors 29 
rounding-off rule 26 


tow vector 229 


tule 
Cramer’s 273 
even-digit 26 
parabolic 603 
tounding-ofi 26 
of signs, Descartes’ 178 
Simpson’s 597 
three eighths 599 
trapezoidal 60! 

Runge, C. 589, 677 


Salekhov, G. 228, 677 
Salvadori, M. G. 321, 648, 677 
sampling error 666 

scalar 229 

scalar product(s) 


method of 436 
of vectors 343 
properties of 343 


Scarborough, J. B. 54, 161, 321, 506, 


573, 589, 648, 677 


Horner’s 77, 78 
eneralized 80-81 

of Khaletsky 295, 297 

of unique division 280 


scheme 


Schreier, O. 272, 677 
scientific notation 18, 24 
secular determinant(s) 376, 565 


expansion of 4!0ff 
comparison of diferent methods of 


secular equation 376 
Seidel method 209ff 
Seidel process 155 


convergence of 327, 330, 333 

estImating the error of approxima- 
tions in 330 pi 
by inorm 332 
by m-norm 330 


separated roots 180 
sequence 


Convergent (of matrices) 249 
Gaussian random 652 
pseudorandom 654 


random 65! 
Sturm 174 
series 


convergence of, acceleration of 89 
Fourier trigonometric 213 


Index 


Maclaurin’s 89 
matrix, sum of 25) 
numerical 83 
convergent 83 
power (see method of power series) 504 


sum of 83 
Taylor's 89 
sign 86 


Shabat, B. V. 202, 675 
Shapiro, G. M. 202 677 
Shestakov, V. I. 114, 676 
Shilov, G. E. A. 393, 677 
Shreider, Yu. A. 393, 674, 677 
Shura-Bura, M. R. 114, 676 
sign(s) 
Descartes’ rule of 178 
variations of (see variation of sign) 174 
sign changes 174 
significant digit 24 
signum function 86 
similar matrices 339, 380 
symbol! for 339, 380 
simple continued fraction 56 
Simpson type cubature ‘formula 644 
Simpson’s cubature formula 645 
Simpson’s formula and its remainder 
term 596 
Simpson’s genera! formula 603 
Simpson's rule, formula of 597 
singular linear transformation 374 
singular matrix 236 
singular transformation 375 
singularities, isolation of by Kanto- 
rovich method 635 
Smirnov, V. I. 114, 409, 677 
Smolitsky, Kh. L. 321, 677 
solution(s) 
approximate (of systems of nonlinear 
equations) 459 
of a difference equation 198 
fundamental system of 365 
graphical 119 
solution set (of a-system of equations) 
274 


solution space of a homogeneous system 
364f 


space 
basis of 340 
generated by vectors 342 
linear vector 336ff 
n-dimensional 336 
point of 336 
vector of 336 
n-dimensional real 344 
solutions 364 
spanned by vectors 342 
Spacing (see interval) 507 
span (spanned by vectors) 342 
Sperner, E. 272, 677 
square matrix 229 
square roots, computing 107 
square-root method 293 
standard continued fraction 56 
steepest descent? method of 496, 499 
for a system of linear equations 


Steffensen, J. F. 599, 648, 677 

Stenin 465 

Stirling, J. 205 

Stirling’s expansion 207 

Stirling’s interpolation formula 533 
Sturm method 173 

Sturm sequence 174 

Sturm’s theorem 174 


687 


submatrix’ (submatrices) 256 
subspace 
dimensionality of 342 
linear (see linear subspace) 344 
substitution 
back 280 
forward 280 
subtraction, loss of accuracy in 35 


error of 33 
of a series 83 

Sylvester conditions 389 

symmetric matrix (matrices) 235, 384 
properties of 384 

symmetry, Hermitian 343 

system(s) 
fundamental (of solutions) 365 
linear (see linear system) 
of linear algebraic equations, solu- 

fen of by Monte Carlo method 


of linear equations 
solution of by orthogonalization 
methods 358 
solving 273 
of nonlinear equations, approximate 
solution of 4599 
norma! 311 
norma! linear 312 
orthogonal 345 


table(s) 
Barlow’s 113 
of central differences 530 
for determining limiting relative error: 


difference (see difference table) 510 
of divided differences 556 
double-entry ` 569 
of random numbers 653 
compilation of 653, 654, 656 
tangents, method of 127 
Taylor polynomia! 89 
Taylor’s series 89 
term of xth component of continued 
fraction 55 
remainder 89 
Ter-Mikaelyan, T.M. 114, 676 
test, Cauchy’s 83 
theorem } 
of Budan-Fourier 175, 176 
convergence 214 
Cayley-Hamilton 397 
fundamental (of algebra) 162 
Hua’s 179 
Hurwitz 406 
Kantorovich 465 
Lagrange’s 168 
Newton's 17! 
Ostrowski’s 159, 465 
Perron’s 390 
of Pringsheim 71 
Rolle’s 133 
Sturm’s 174 
Weierstrass 389 
theory 
of continued fractions 55ff 
of errors 4! 
inverse problem of 44 
second 48 


688 


three-eighths rule 599 
Tolstoy, G. P. 114, 161, 228, 677 
trace of a matrix 377 
trajectory 668 
transformation(s) 367 
of coordinates of a vector under chan- 
ges in the basis 348 % 
Euler-Abel 210 
identical 375 
inverse 373 
Kummer 203, 204 
linear (of variables) 367 
degenerate 374 
operating with 371 
singular 374 


by a matrix, properties of 38! 
nonsingular 374 

projection 369 

rotation 370 

singular 375 

transformation matrix 368 

transition probabilities 667 

transition probability matrix 667-668 

transpose of a matrix 234 
properties of 235 

trapezoidal formula and its remainder 

term 595 

trapezoidal rule 60! 

triangular matrices 265 

trigonometric approximation 225 

trigonometric functions 49, 98 
computing values of 98 


undetermined coefficients, method of 411 
unique division, scheme of 280 
unit matrix 230 


Vallée-Poussin, C. J. de la 228, 677 
Vandermonde determinant 592, 614 
variable interval 555 


Index 


variations of sign 174 


lower number of 175 
upper number of !76 


vec tor(s) 


angle between two 345 
column 229 
coordinates of 336, 34! 
length of 345 
linear dependence of 337 
linearly dependent 337 
finearly independent: 337 
of n-dimensional space 336 
negative of 336 
orthogonal 345 
orthogonal systems of 345 
product of 337 
product of by a scalar 337 
Tow 229 
scalar product of 343 
properties of 343 
zero 336 
Ventsel, D. A. 54, 161, 677 
Venisel, E. S, 54, 161, 506, 674, 677 
very much less than (<) 22 
volume, algebraic 656 


Wayland, Harold 458, 678 
Weierstrass theorem 389 

Whittaker, E. T. 161, 272, 573, 678 
Willers, F. A, 465 


Young’s modulus 44 


Zaguskin, V. L. 202, 678 

zero of a function 115 

zero matrix 230 

zero vector 336 

Zhidkov, N. P. 321, 458, 648, 675 


Mir Publishers would be grateful for your comments on the 
content , translation and design of this book. 
We would also be pleased to receive any other suggestions 


you may wish to make. 
Our address is: 
Mir Publishers 


2 Pervy Rizhsky Pereulok 
_J-110, GSP, Moscow, 129820 


USSR 


Printed in the Union of Soviet Socialist Republics 


